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Preface 


In 2019, the annual joint workshop of the Fraunhofer Institute of Optronics, 
System Technologies and Image Exploitation (IOSB) and the Vision and Fusion 
Laboratory (IES) of the Institute for Anthropomatics, Karlsruhe Institute of 
Technology (KIT) has again been hosted on the Griesgethof nearby the town of 
Triberg-Nussbach in Germany. 


For a week from July, 29 to August, 2 the PhD students of the both institutions 
delivered extended reports on the status of their research and participated 
in thorough discussions on topics ranging from computer vision and optical 
metrology to usage control and neural networks. Most results and ideas presented 
at the workshop are collected in this book in the form of detailed technical 
reports. This volume provides a comprehensive and up-to-date overview of 
the research program of the IES Laboratory and the Fraunhofer IOSB. Special 
thanks goes to Prof. Dr. Stephan Klaus from the Mathematical Research Institute 
of Oberwolfach for giving us a very inspiring tour through the MiMa, Museum 
for Minerals and Mathematics on the excursion day of the workshop. 


The editors thank Julius Krause, Florian Becker, Arno Appenzeller, Paul Wagner 
and other organizers for their efforts resulting in a pleasant and inspiring 
atmosphere throughout the week. We would also like to thank the doctoral 
students for writing and reviewing the technical reports as well as for responding 
to the comments and the suggestions of their colleagues. 


Prof. Dr.-Ing. habil. Jürgen Beyerer 
Dr. Tim Zander 
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Towards a Privacy Compliant Research Interface for 
Multicenter Medical Data 
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Technical Report IES-20 19-06 


Abstract 


Big Data analysis gains more and more interest in the processing of e-Health 
data. The potentially big benefit of those analyses comes with a set of new 
unknown impacts to an individual’s privacy. Still it is important to find a balance 
between privacy impact and utility of the medical data analysis. To achieve this, 
this technical report takes a look on different privacy preserving techniques, 
that could be used for a privacy preserving research interface for medical data. 
The three techniques Differential privacy, k-Anonymity and Secure multi-party 
Computation are evaluated on their feasibility for a medical use-case. With 
those preliminaries some formal definitions are made for a privacy preserving 
research interface which implements an hybrid approach of the three techniques 
and a consent based interface. 


1 Introduction 


The digitization in the health care sector is starting to gain more and more 
traction. As a consequence of the digitization more e-Health data than ever 
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before is accessible for broad use cases. As the amount of data to a given topic 
is growing, Big Data research usually start to become interested in those topics. 
Especially for medical data Big Data promises new therapies and new valuable 
insights on different diseases (12). A more or less open question from a technical 
perspective is data protection regarding medical data. From the law perspective, 
for example with the European General Data Protection Regulation (GDPR), 
there is a firm opinion on privacy of medical data. However there are many open 
question when processing large amount of medical data. In general the GDPR 
categorizes personal health information as special data. Article 9 Paragraph 1 
says: "Processing of [...] data concerning health or data concerning a natural 
person’s sex life or sexual orientation shall be prohibited" Bl. At first this 
means that the processing of personal health data is not allowed. But Article 9 
Paragraph 1 a) to j) has exclusions, which allow the processing of this special 
category of data. One of these exclusions is, if the affected person consents 
to the usage of their data. Other reasons that allow the processing, like the 
processing for public interest, are more ambiguous than the explicit consent. 
While the GDPR asks for an explicit permission for the use of the data from an 
affected person, even the processing of a large amount of anonymized data does 
not guarantee privacy. Furthermore a recent study showed that the combination 
of 15 different attributes per dataset is enough to identify an exact person in the 
US [10]. This proofs that even if data is only processed in an anonymized way, 
additional measures have to be taken if an affected individual does not explicitly 
consent to a certain risk of de-identification. 

Another fact we face when working with medical data is that the data envi- 
ronments are often multi centric. This means that the data of a single patient 
is split across different clinicians or hospitals. As a consequence data from 
multiple sites need to be coordinated, which means in most cases that a trusted 
party is needed as a broker for the data. Furthermore the privacy of the data 
is an important questions when coming from different sources and the data is 
potentially used in different sites for different purposes. Besides the challenge 
of a research interface for multi centric health data, there are other challenges 
like how to merge the data of a single patient from different sites or how the 
different data providers can be connected securely. However for this technical 
report we focus on a potential research interface for multicenter medical data. A 
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main requirement for this is the privacy compliant processing of the personal 
health information. While maintaining this and providing anonymized/pseudon- 
omynized data when needed, another important thing is to provided a back 
channel for potential results out of the processed data. Especially if they have 
important results for an individual. 

In this technical report we will have an in-depth look at various techniques to 
provide privacy on personal data in big datasets while still retaining maximum 
data precision. Afterwards we will present a concept that combines those 
mechanism with additional techniques that consider consent to provide a privacy 
preserving research interface for multicenter medical data. In the end this 
concept will be concluded and an outlook is provided. 


2 Related work 


Like mentioned in the introduction a recent study by Rocher et al. showed that 15 
different attributes are enough to identify 99.8% of the citizen of Massachusetts 
110]. The claim is proven with a statistical model. This applies regardless how 
incomplete the data is, so anonymization will not provide enough benefit to 
protect an individuals privacy. So even a training set for a machine learning 
algorithm can be a privacy risk. Because of this conclusion the authors demand 
for even higher measures, than for example the GDPR demands, to protect the 
privacy of individuals. 

The project "PAPAYA: A Platform for Privacy Preserving Data Analytics" 
focuses more on the specific issue of a privacy preserving research interface for 
medical data (2). Ciceri et al. introduce a project to create privacy-preserving 
neural networks. The approach uses a combination of encryption, secure multi- 
party computation, differential privacy and functional encryption. Different 
data sources are used to train a neural network. The training data is discarded 
afterwards. All in all they do not provide an in-depth look of their approach. 
But they present the idea of using differential privacy for the training data to 
add noise to the original data. 

Another project that provides a research interface for medical data is the MOSAIC 
project 11. Bialke et al. describe this in "MOSAIC - A Modular Approach to 
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Data Management in Epidemiological Studies". The authors want to comply 
with privacy requirements by using study-specific pseudonymisation and giving 
access for third parties only through a designated interface. An interesting 
fact about MOSAIC is that it enables a designated back channel for research 
algorithm. With this the algorithm can give back individual findings that 
occurred during the processing. Unfortunately the concept is not explained in 
more detail. 


3 Privacy preserving techniques for multicenter 
environments 


The following section presents three techniques that can preserve privacy for 
large databases. Therefore they can be used for multicenter environments. 
Finally the three techniques will be evaluated by criteria like accuracy and 
privacy guarantees. 


3.1 Differential privacy 


In 2006 Dwork et al. introduced the notion of e-Differential Privacy [6]. In 
general Differential Privacy has the goal for a certain data in a statistical database 
to achieve the same level of privacy as if the data is removed or never was in the 
database. This means that the data of a single individual needs to be modified 
so that the individual can not be identified. With this approach privacy can be 
preserved while still retaining a good utility for the processing of the modified 
data. The assumption for Differential Privacy is, that the likelihood that there is 
any disclosure, is a very small number regardless if the data is in the database or 
not. To be more specific the e in e-Differential Privacy describes the privacy loss 
when a dataset is released from a database. Therefore a really small e is desired 
but certainly it remains important to keep the utility of the data. Formally K is 
a e-Differential Privacy algorithm if the following is valid: All available data 
are part of the set S. Dı and Də are datasets that have the difference of at most 
one element. 
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Definition 3.1.1 (e-Differential Privacy Algorithm). 


Pr[K(Dı) € S] < e * Pr[K(D2) € 5] (3.1) 


The conclusion of this definition is that even if data is removed no output and its 
consequences in regard of privacy loss becomes significantly less or more likely. 
Which ultimately means that it does not matter if data is in or not in a database, 
if K fulfills the requirement of Definition [3.1.1] 

With this strong privacy guarantees can be achieved but an important factor is 
the size of the dataset: The smaller the database the higher the noise added (or 
the smaller e) has to be to alter/randomise the original data. 

Another important question is what a good Differential Privacy algorithm is. 
This question can not be answered in general because it depends on the use case. 
If the use case is to process numeric values for statistical operations like sum, 
median or average a good choice is Laplacian noise. This uses the Laplacian 
mechanism to add noise to the input data. For this algorithm the e is a measure 
for the randomization. If e = 0 the privatized data is complete random noise. 
While in theory this provides obviously the best privacy, the data has no more 
real utility and leads the Differential Privacy approach ad absurdum. 
Differential Privacy can be divided in two different variants. The one is Global 
Differential Privacy where all original data is stored globally. Only the output of 
this original data is aggregated to fulfill the requirements of Differential Privacy. 
For this approach a trusted third party which manages the data is essential. The 
other variant is Local Differential Privacy. Here every individual or data owner 
modifies the data before it leaves the origin, so that the original information 
is nowhere else. For this no trusted third party is needed because the data is 
already modified when it reaches another party. Besides e-Differential Privacy 
there also exists (e,ö)-Differential Privacy. This version of Differential Privacy 
accepts deviations by ö from the original notion like in Definitionß.1.1] 
Differential Privacy is a concept that sounds very promising in theory. While 
there are practical use cases (even Apple [5] and Google [8] are using it in their 
mobile systems) the real utility depends on the scenario it is used. There is 
a review paper by Dankar et al. which provides an in-depth look at medical 
applications but still the conclusion is that besides statistical evaluations it is 
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very limited (4). However for a combination of different techniques Differential 
Privacy is one of the most promising ones. 


3.2 k-Anonymity 


Another technique to preserve privacy is k-Anonymity. The method was 
introduced in 2002 by Sweeny et al. (11). The main principle of k-Anonymity 
is to alter the existing data of a database, so that they still have utility but it 
is guaranteed that affected individuals with data in the database can not be 
reidentified. A collection of datasets can be called k-anonymous if one of the 
datasets ca not be distinguished from k — 1 other datasets. 


Example 3.2.1 (4-Anonymity). A k = 4 anonymized dataset has at least 4 
records for each value combination of certain attributes that k-Anonymity 
applies to. 


There are two methods to achieve k-Anonymity: 


e Suppression: Parts of the data will be removed, disguised or made 
indistinguishable (Mapping all data to the same pseudonym e.g.). 


e Generalization: Modify parts of the data to ranges of values instead of 
exact values or assign attributes to a more general type. 


One issue with k-Anonymity is that there is no general measurement for the 
privacy guarantee. Furthermore additional domain knowledge is required for 
suppression or generalization of the data. In some cases there are guidelines 
that could be used for generalization. For example the Canadian Institute for 
Health Research published the "CIHR Best Practices for Protecting Privacy in 
Health Research" which helps to generalize medical data. 

A medical use case for k-Anonymity is described by El Emam et al. (7). Here the 
previous mentioned guidelines from the Canadian Institute for Health Research 
are used as background knowledge for an algorithm that generalizes medical 
data. With this the generalization can be performed automatically and it is also 
possible to measure the information loss compared to the original data. So 
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the privacy impact on a dataset to which the guidelines apply can be reduced. 
They also show real world feasibility of the approach by using it to hand over 
k-anonymous data from pharmacies to commercial data brokers. However the 
the issue of a universal generalization remains and every use case has to be 
considered individually. 


3.3 Secure multi-party computation 


The main principles of Secure multi-party Computation (SCM) were already 
introduced by Yao in the 1980s 13) . The basic idea of this was to evaluate data 
from different parties without revealing the data. 

According to Lindell and Pinkas there can be two models to achieve this 91. In 
one case there is a trusted third party that evaluates the data for the participating 
parties. The other case has no third party one can trust with its data. In this 
case a direct communication between the data is needed and it needs to be 
ensured that the data already leaves the participating parties in a private state. 
The typical scenario for SMC is that there are several parties that own private 
data. All parties want to evaluate their data to acommon public result. This 
can also mean that a third party like a research institute gets this data to do the 
evaluation. The main issue in this scenario is that there is no trust established 
between the parties or the parties do not want to reveal their data. A special 
variation of this scenario exists when there is a third party that does the data 
processing and returns the value to the parties. However for a medical use case 
it still remains important that the participating parties do not get the raw data 
but only the final result. 

A concrete example for such a scenario is to calculate the average salary of three 
parties. When using the secret sharing the typical procedure is that the starting 
party chooses a secret r. This secret is added to the own salary x and the result 
will be sent to the second party. The second party adds its salary y and sends 
it to party number three which follows the same procedure. This can be easily 
extended to an arbitrary number of parties. Finally after the round trip the first 
party gets the result back and subtracts r to receive the final value to calculate 
the average without revealing its salary to the others or gaining knowledge of 
the others salary. 
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Another approach to this is using homomorphic encryption. In this case certain 
mathematical operations can be done with the ciphertext without knowing the 
secret key or the need to decrypt it. The operations depend on the homomorphic 
properties of the encryption method. For example an additive homomorphic prop- 
erty would mean that it is possible to calculate Enc(a) + Enc(b) = Enc(a + b). 
It needs to be considered that for plain encryption those methods would have a 
lot of weaknesses to adversaries, but the measures are enough to preserve data 
privacy. A possible scenario for this would be a third party research algorithm 
that does a cohort analysis for a clinician. For this it needs the data from 
the clinician and other participants that provide the comparison data to create 
the cohort. A main requirement is that the third party does not see the plain 
data. To realize this a key broker is required which gives a common key to all 
participants. With the resulting chiphertexts the third party algorithm can do its 
cohort analysis using the homomorphic properties. 

An obvious advantage to the previous techniques is correctness of the result 
which also implies precision. That means while the results achieved with 
Differential Privacy or k-Anonymity can differ to a certain degree from the real 
result, SMC always returns the exact result. An issue with SMC is that it has a 
big overhead in terms of run time. Even simple operations can use a lot of time. 


3.4 Evaluation of the techniques 


After the introduction of the three different techniques considered in this report, 
we will do an evaluation of them that considers the strengths and weaknesses of 
the techniques. Table B.T] gives an overview of this. 

In terms of privacy guarantees both Differential Privacy and k-Anonymity have 
metrics that make a statement about the degree of privacy. SMC’s guarantees are 
dependent on the encryption mechanism used and can not be generalized. Full 
accuracy is provided when using SMC while the privacy preserving mechanism 
does not rely on modification of the data. Differential Privacy’s accuracy 
is affected by the choice of e, where a very large € provides good accuracy 
but not much privacy. For k-Anonymity no general assumption can be made 
because the accuracy depends on the generalisation/surpression method. When 
considering scalable performance Differential Privacy as well as k-Anonymity 
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should provide good results regardless the amount of data while SMC has a 
lot of overhead because of the encryption mechanism. Lastly it is important 
if any kind of trusted party is needed to perform the techniques. Differential 
Privacy and k-Anonymity require a party the manages the data. If considering 
Differential Privacy it is possible that the local approach is used so the trusted 
party is only needed for the global approach. Only SMC offers the option to 
operate completely without a trusted party, if the participants communicate 
directly with their ciphertext. 


Table 3.1: Overview of privacy preserving techniques 


Techniques 


Secure multi-party 


Differential Privacy k-Anonymity 


Computation 

Privacy 

e e 
guarantuees 
High 

O e 
Accuracy 
Scaleable 

e e 
performance 
Trusted Party Partly No Yes 
needed 

Utility and ssi 
Choice of e affects N my = procesema Requires domain 

Limitations time heavily depends on 


properties knowledge 


the type of SMC 


4 A privacy compliant research interface 


To define a research interface it is important to understand the difference between 
a non-interactive interface and an interactive one. A non-interactive research 
interface is one where the data is released once and for all and there is no way 
to modify the data for a certain request. An interactive research interface can 
decide the privacy strategy for each query since only the data for the given 
request is released and the complete data remains hidden through the interface. 
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We think that for a privacy preserving research interface it is important not to 
follow an one fits all approach. There are different kind of queries that can 
require different degrees of accuracy. The main goal should always be: Preserve 
as much privacy as possible and lose as less accuracy as possible. This can 
only be achieved with a hybrid approach. On the one hand a combination of the 
previously introduced techniques, that are used for the range of queries where 
the individual technique, can be used best. On the other hand those techniques 
all fall in specific use cases and can reach their limit, where no more useful query 
is possible. Furthermore there can be some requests where both the researcher 
and the affected person can benefit from data that is not anonymized. You can 
think of queries that can provide feedback on the individual person. For those 
queries the person’s consent is mandatory. 


To include this in the desired fully automated research interface a mechanism is 
required to map the consent in a digital format. Furthermore this consent should 
be dynamic so that an affected person can authorize or revoke it at any time. In 
addition to enable automatic evaluation of this, an enforcement mechanism is 
needed to evaluate consent for each query. Medical consent in a digital format 
is anon trivial task with some existing concepts but most of them are far from 
complete. We will postpone this part which we call consent based interface to 
future work. 

We assume that he research interface exposes a set of privacy functions like 
P_SUM, P_AVERAGE, P_MEDIAN etc. to do operations on attributes 
of the data in the database. 


Definition 4.0.1 (Privacy preserving functions). A privacy preserving research 
interface defines a Set F of privacy preserving function. They all follow the 
following naming convention P_* where * is a mathematical function like 
SUM or COUNT. 


To perform a query the researcher has to provide additional properties. It needs 
to be defined if accuracy or privacy to which scale is desired or if an algorithm 
wishes to provide additional feedback to an individual feedback. 


Definition 4.0.2 (Privacy preserving configuration). A privacy preserving 
research interface has a Set C = {accuracy, privacy(x), feedback} which con- 
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tains the privacy preserving configuration for a request. privacy(x) has x € Nt 
as number to indicate the factor of the privacy impact. 


With this a request can be formulated. Such a request uses a query language in 
an interface specific language where the request attributes from the dataset can 
be defined. A privacy function out of F also needs to be used in this query. In 
addition a configuration needs to be provided to indicate what the requirements 
for the request are. 


Definition 4.0.3 (Privacy preserving request). A request req for a privacy 
preserving research interface looks like the following: req = (query, config) 
where query is a query made with a query language QL that includes F and 
config € C. 


With such a request req the interface can now decide depending on config which 
privacy preserving technique should be used. The following Definition [4.0.4] 
illustrates this. 


Definition 4.0.4 (Evaluation of config). 


if accuracy — use SMC 


if privacy(z) — use Differential Privacy 
config = 
— or k-Anonymity depending on x 


if feedback — use consent based interface 


5 Conclusion & outlook 


This technical report looks at three different techniques to preserve privacy on 
an individuals data. All of these three techniques have various advantages and 
disadvantages. While Differential Privacy and k-Anonymity have good privacy 
guarantees they can lack accuracy. SMC can provide accuracy on the results 
but its performance can be a great uncertainty. So there is certainly no one fits 
all approach. In fact a hybrid approach that combines those three techniques 
and that chooses the best depending on the requirements for a certain request is 
proposed. In addition there can be requests where those techniques can not help 
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or do not fit the requirement. Therefore a fallback to the individuals consent 
is needed. With this definition of a privacy preserving research interface for 
multicenter medical data the foundation for more in-depth work and experiments 
with real world e-Health data is made. 


While this report provides the fundamentals a real world evaluation needs to 
be done. It needs be proven that the introduced privacy preserving techniques 
work good on real medical data. Another issue that remains is a good privacy 
metric. This is especially required for an informed consent decision of a patient. 
Considering that the consent based interface needs to be introduced in future 
work. With this integration a full feature research interface is possible, which 
remains open for further refinement. Finally this approach should be evaluated 
against the GDPR. It has to be figured out what is needed to be compliant to it 
and what an interface should provide to fulfill requirements of the GDPR. 
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Abstract 


There are real world data sets where a linear approximation like the principal 
components might not capture the intrinsic characteristics of the data. Nonlinear 
dimensionality reduction or manifold learning uses a graph-based approach to 
model the local structure of the data. Manifold learning algorithms assume 
that the data resides on a low-dimensional manifold that is embedded in a 
higher-dimensional space. For real world data sets this assumption might not be 
evident. However, using manifold learning for a classification task can reveal a 
better performance than using a corresponding procedure that uses the principal 
components of the data. We show that this is the case for our hyperspectral data 
set using the two manifold learning algorithms Laplacian eigenmaps and locally 
linear embedding. 


1 Introduction 


Nonlinear dimensionality reduction or manifold learning is a useful tool for 
high-dimensional data analysis. In contrast to linear dimensionality reduction 
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as it is performed by a standard principal component analysis (PCA), with 
manifold learning the possibly low-dimensional manifold that is embedded in a 
high-dimensional space can be uncovered. This so-called manifold assumption 
is central to the theory of manifold learning and states that the data resides 
on a low-dimensional manifold in high-dimensional space. Manifold learning 
has been applied to many computer vision problem domains including face 
recognition (3). image retrieval [4] and medical image analysis (1). Due to the 
high spectral resolution of many hyperspectral image data sets and the high 
correlation between adjacent and overtone bands, manifold learning has received 
some attention in the research community (5). 

In this technical report, we will first review the basics of manifold learning, 
why it is a useful framework and how it can be utilized for classification in a 
semi-supervised manner. Finally, we will apply this semi-supervised procedure 
to a hyperspectral data set consisting of four different kinds of wood (chips): 
eucalyptus, poplar, beech and spruce. The results indicate that manifold learning 
outperforms a linear approach using PCA. 


2 Classification with manifold learning 


Discovering the low-dimensional manifold embedded in a higher-dimensional 
space can be utilized for classification. We aim to show two aspects of manifold 
learning: First, it can be employed for classification, second, manifold learn- 
ing outperforms a corresponding linear procedure using principal component 
analysis. In general, dimensionality reduction is often used as a step prior to 
classification. This is due to the fact that for many datasets, the dimensions 
of individual data points might be correlated due to the physical nature of the 
process that has generated the data. For instance, in (near) infrared spectroscopy 
overtone bands can be observed that are a manifestation of the vibrational modes. 
As the resonant frequencies can be approximated by an harmonic oscillator, 
characteristic peaks in the spectrum might arise from the vibrational modes 
of the same chemical substance. For a classification task correlation means 
that specific dimensions might not carry valuable information, in the sense that 
the additional information does not lead to a better separability of the data and 
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therefore also does not contribute to the classification performance. Removing 
correlated dimensions can therefore lead to a simpler classifier with less param- 
eters. When applying manifold learning prior to classification, the objective 
is to exploit the manifold assumption. Manifold learning is a good fit to the 
data when there are non-linear dependencies between different dimensions. In 
practice, it is not evident that non-linear dependencies exist in high-dimensional 
data. However, if manifold learning leads to better classification results than a 
linear method, this might indicate the presence of an intrinsic low-dimensional 
manifold. 


3 Laplacian eigenmaps 


We briefly review the basics of one popular manifold learning algorithm 
called Laplacian Eigenmaps (LE). Given data samples X = {x;} o C R”, LE 
computes a Laplacian matrix according to a kernel function. The final mapping 
is then defined by the eigenvectors of the graph Laplacian matrix. A detailed 
description is given by Algorithm 3.1] below. Central to the algorithm is the 
choice of the kernel function. We call a symmetric function k : ¥ x ¥ > R 
a kernel, if the induced Gram matrix defined by K;; = k(x;,xj) is positive 
semi-definite, i.e. 


N N 


a! Ka = 5 5 a; Kij zj > 0, (3.1) 


i=1 j=1 


for all æ € R”. This is the discrete analog to Mercer’s condition (6) which states 
that the function K : [a,b] x [a,b] > R fulfills the inequality 


[| FOKE > 0 3.2) 
for every function f € L?(IR). A symmetric kernel function satisfying Mercer’s 


condition leads to nonnegative real eigenvalues and orthogonal eigenvectors for 
the corresponding kernel matrix. 
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Algorithm 3.1 Laplacian Eigenmaps 


1: procedure LAPLACIAN EIGENMAPS 

2 Input: data ¥ = {x;}4_, C R” 

3 Output: embedding Y = {y;}_ 9 ER” 

4: 1.) Build an adjacency graph G = (V, E) 

5 nodes v; € V and v; € V are connected if ||a; — z,||3 < € 
6 2.) Pick weights 

7 Choose a kernel function k(&;,&;) and set 


k(a;, £3) (1,5) EE 


8: Wi; = 
0 else 
9: 3.) Compute Eigenmap 
10: Ly=XDy, with Dj; = pF, W,, and L = D - W 
11: xi > (y1 (i), -, Ym(i)) 


12: end procedure 


The embedding is found by computing the generalized eigenvalue problem 
involving the graph Laplacian and the corresponding degree matrix. The 
nonlinear nature of Algorithmß.1]is due to the choice of the kernel function. 


4  Semi-supervised manifold learning 


Semi-supervised machine learning methods make use of unlabeled data points 
for training. Transductive learning is one variant of a semi-supervised learning 
setting where the correct labels of some given unlabeled data points must be 
inferred. This is in contrast to inductive learning where a function is learned that 
maps a data point to its label. Manifold learning algorithms are label-agnostic: 
In order to build the adjacency graph no information about class labels is 
necessary. The main idea behind a semi-supervised manifold learning approach 
is that the kernel matrix is built using labeled and unlabeled data points. The 
resulting matrix quantifies the similarity between all pairwise data points. As a 
subset of these data points is labeled, the kernel matrix relates each unlabeled 
data point to every labeled data point. The computation of the eigenmap and 
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the projection of the high-dimensional data leads to an embedded space with 
partially labeled data points. Unlabeled points can be classified with a simple 
nearest-neighbor search. In this way, the intrinsic manifold structure-given that 
it exists-is put to use for a classification task. 


Algorithm 4.1 Semi-Supervised Manifold Learning 


1: procedure SEMI-SUPERVISED MANIFOLD LEARNING 

2 Input: labeled data {(x1,¢1),...,(&p, Cp) }, 

3 unlabeled data {z7,...,27} 

4 Output: labels for {x7?,..., x7} 

5: 1.) Compute embedding by manifold learning algorithm 
6 e.g. by Ly = ADy 

7 2.) Embed all data points 

8 zi — Yıli),---,Ym(i) 

9 3.) Classify unlabeled data points 
10: for all unlabeled data points x” do 
11: get the labels of the k nearest labeled points in the embedded space 
12: assign data point x“ the most common label 
13: end for 


14: end procedure 


The procedure described above can be used together with any manifold learning 
algorithm. In order to compare LE, we also apply a further manifold learning 
algorithm to the data set called locally linear embedding (LLE). For a given 
data set X = {x;}\ o C R”, LLE tries to reconstruct every data point from a 
linear combination of its k-nearest neighbors. LLE minimizes the following 
cost function: 


N 
EWW)=Solla- Š  wz;ll 
t= 


aj EN; (wi) jŻi (4.1) 
at.) Wiz =1 Vj € {1,...,N} 
i=1 
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Figure 4.1: Spectra of the four different woods: eucalyptus, poplar, beech and spruce. This plot 
also shows the standard deviation (0.10) around the mean. 


N,(x) denotes the set of k-nearest neighbors of x. In order to achieve a 
neighborhood preserving map, the resulting weight matrix from the optimization 
problem[4. I|above is used to find an embedding: 


N 
Er)=) lun- % will. (4.2) 
el 


VEN (yi) GAG 


In the following, we describe the methodology that was used to apply and 
validate Algorithm [4.1] for hyperspectral data. The hyperspectral images were 
acquired using a Specim SWIR camera with spectral range from 950 nm—2500 
nm and a spectral resolution of 10 nm. Figure [4.1]shows the entirety of the 
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For each cross-validation 
iteration 


random subsample of 100 labeled spectra & random subsample of 50 unlabeled spectra 
(for every wood type) A a (for every wood type) 


Embed spectra via manifold ale 


Classify unlabaled points with k-NN in embedded space 


Figure 4.2: The proposed methodology. We acquired separate data sets for training and testing. 
For each cross-validation iteration, we sampled 100 labeled and 50 unlabeled spectra from every 
wood type. No further preprocessing of the spectra is applied. Based on this data, the Laplacian 
(and the locally linear embedding optimization problem) is computed. The images above of the fine 
wood chips are averages over all hyperspectral bands. 


spectra for the four classes in terms of a mean spectrum with 0.1o. Separate 
image sets were acquired for training and testing. To evaluate Algorithm|4. 1] a 
target dimension of 2 was chosen for all dimensionality reduction procedures. 


5 Results 


The above methodology leads to the results given in Tableß.1] The results indi- 
cate that the used manifold learning algorithms outperform linear dimensionality 
reduction in terms of a |-nearest neighbor classification in the embedded space. 
Furthermore, we used two different kernel functions k,pr and k.os. The overall 
accuracy for k.os leads to better results. As the spectra were not preprocessed, 
this result is not too surprising as the cos-similarity is invariant to linear shifts 
of the spectrum-which is in contrast to the rbf-kernel. We furthermore observe 
that LE outperforms LLE for our data set. 
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Table 5.1: Classification results for PCA and the different manifold learning algorithms using the 
the semi-supervised manifold learning procedure outlined in Algorithm[4.1] The overall accuracy 
(OA) is given in the last column. 


Method Eucalyptus Poplar Beech Spruce OA(u+o) 


PCA 0.61 0.74 0.76 0.58 0.67 + 0.036 
LLE;=30 0.68 0.81 0.76 0.60 0.71 + 0.019 
LLE;=40 0.72 0.83 0.76 0.63 0.74 + 0.018 
LEr»f 0.75 0.86 0.76 0.61 0.74 + 0.020 
LEcos 0.76 0.95 0.75 0.66 0.78 + 0.016 


Especially LE.os significantly outperforms the PCA-based approach. In addition, 
as indicated by the standard deviation, LE.os is the most robust method, while 
throughout the cross-validation the variance of the PCA-based procedure is the 
highest. 


6 Conclusion & Outlook 


In essence, Laplacian eigenmaps and locally linear embedding build a discrete 
approximation of the underlying data manifold. By computing a weight matrix 
that captures the local structure of the data, the intrinsic characteristics are 
utilized for dimensionality reduction. The induced neighborhood preserving map 
is a suitable tool for high-dimensional data analysis. We have applied manifold 
learning for a semi-supervised classification task and showed that it outperforms 
classification in the space that is defined by the principal components. Our 
results indicate that choosing a kernel function is a critical step for LE. Manifold 
learning has the potential to uncover the low-dimensional manifold of the data. 
Future work should continue to examine this potential. 
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Abstract 


Ellipsometry is an optical method used for characterizing materials and thin 
films. The principle of ellipsometry is that it measures polarization changes at a 
sample in a reflection or transmission configuration. However, the shape of the 
sample is limited to flat or nearly flat surfaces because ellipsometry is sensitive 
to the angle of incidence, tilt angle and the sample position (height). Even slight 
misalignment of the sample might lead to significant experimental errors. For 
large misalignment, the detector of the ellipsometer is not feasible to receive 
sufficient signals. There have been a few approaches for characterizing nonplanr 
surfaces by ellipsometry. This report gives an overview of these approaches for 
ellipsometric measurements of nonplanar surfaces. 


1 Introduction 


Ellipsometry is an optical technique for characterization of materials and thin 
films. The main features of ellipsometry are high precision (thickness from 
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a few Ä to several tens of microns), nondestructive measurement, and wide 
applications. The principle of ellipsometry is that it measures polarization 
changes at a sample in a reflection or transmission configuration. Fig. 
shows the principle of reflection ellipsometry. The incident light is linearly 
polarized. After the reflection from the substrate, the reflected light becomes 
elliptically polarized. The Fresnel equations describe the interaction of light 
(electromagnetic waves) and materials. The polarization changes can be defined 
as the ratio p of the amplitude reflection coefficients for p- and s- polarizations 
[1]: M 

p= Z = tan Yei’, (1.1) 


where W and A present amplitude ratio and phase difference. Ellipsometry tech- 
nique can be applied to many scientific and industry fields, e.g., semiconductor, 
chemistry, display industry and biomaterials 171. 


Ey Ey 
Ex Ex 
Incident light Reflected light 
| 
| 
| 
Air Q 8 


Substrate 


Figure 1.1: Measurement principle of ellipsometry. 


In conventional ellipsometers, samples are limited to a planar shape because 
ellipsometry is sensitive to the angle of incidence (AOT), tilt angle and the sample 
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position (height). Even slight misalignment of the position and orientation of the 
sample might lead to significant experimental errors. For large misalignment, 
the detector of the ellipsometer is not feasible to receive sufficient signals. For 
nonplanar surfaces, the beam path of reflected or transmitted light is changed 
because of the surface shape. In order to solve this problem, different methods 
have been proposed for nonplanar surfaces. In this paper, we will review and 
compare these approaches in the configuration of reflection ellipsometry. 


2 Surface orientation in reflection ellipsometry 


(a) , nd a R 


Figure 2.1: Definition of the surface orientation. (a) An offset h along the surface normal fü. (b) 
The surface rotates around the y-axis. 


Fig. P-1]a) shows a planar surface defining the xy-plane. The z-axis (0,0,1) is 
the surface normal fi. The hat is denote as a unit vector. If the incident beam 1 
is on the yz plane and the incident angle is 6, the incident beam is expressed 
as: (0,sin 0, — cos). The reflected beam 7 can be defined as: (0, sin 0, cos 0). 
The relationship between ñ, 1, and f is shown as (11): 


f =i —2At-A)A. (2.1) 
The angle of incidence 6 is determined by the surface normal N and the incident 


beam ĉ as: 
0 = cosT-i:-n. (2.2) 
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If the surface has an offset h along the surface normal ft as shown in Fig. [2.1{a), 
it will cause an offset d» of the reflected beam as: 


dp, = 2hsin 8 (2.3) 
For an incident angle of 70° and an offset of 1 mm, the offset d», is about 1.88 
mm. 


Fig. P-1]b) illustrates a surface rotates around the y-axis. The surface normal 
of the sample becomes fi’ = (sin d, 0, cos d). Using Eq. we can easily 
compute the angle of incidence 6’ and the reflected beam f” after the rotation as: 


cos 6’ = cos 0 cos ¢, (2.4) 
f = (cos O sin 2¢, sin 0, cos 8 cos 26). (2.5) 


The included angle 9, between the original reflected beam 7 and the reflected 
beam 7” after tilting can be calculated by the product rule from: 


cos 0, = sin? 0 + cos? 6 cos 2¢. (2.6) 


For an incident angle of 70°, if a surface tilts 5° around y-axis, it will produce 
an angle deviation by 3.4° for the detector. If the distance between the surface 
and the detector is 200 mm, it will induce an offset of 11.9 mm. 


From the above calculation results, surface offset and tilt produce a significant 
offset for the detector, which will degrade the measurement accuracy. Therefore, 
special optical designs, compensation methods and precision alignment are 
necessary for ellipsometric measurements of nonplanar surfaces. 


3 Ellipsometric measurements for nonplanar sur- 
faces 


There have been a few approaches for characterizing nonplanar surfaces by 
ellipsometry. These approaches can be categorized into three types: combination 
of topometry and ellipsometry, polarization model for azimuth deviations, and 
return-path ellipsometry with special reflectors. In this section, the basic 
principles and the main features of these approaches will be introduced. 
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3.1 Combination of topometry and ellipsometry 


In order to simultaneously determine the topometry and optical constants of 
surfaces, the combinations of ellipsometry and topometric measurements are 
proposed, e.g., laser interferometry [16], microscopic fringe projection and 
white light interferometry (14). A high numerical aperture (NA) microscope 
objective is used to collect the reflected light, which is shown in Fig. 
The topometric measurements can measure heights in relative to a plane of 
reference and ellipsometric measurements can measure the optical constants or 
film thicknesses. The common feature of these configurations is the off-axis 
focusing method which can provide tilted irradiation on the surface, high lateral 
resolution, and collect the reflected light from the nonplanar surface. High NA 
microscope objectives can measure steep inclinations of surfaces. However, 
the working distance is short, e.g, an objective with a NA of 0.8 has a working 
distance of 1 mm. 


Off-axis | 
illumination 


Optical axis 
l 


Objective 


Sample 


Figure 3.1: Internal focusing and off-axis illumination with a tilted sample. 
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In contrast to the internal focusing, Wirth proposed micro-deflection- 
ellipsometry to combine topometric and ellipsometric measurements , which 
is shown in Fig. He used a lens system to collect the reflected light for 
the polarization state generator (PSG), and a beamsplitter to split the reflected 
light to a position sensitive detector (PSD) and an ellipsometric detector. The 
PSD can determine the surface orientation and the ellipsometric parameters 
can be obtained by the ellipsometric detector. In order to receive evaluable 
signals from the curved surface, the diameter of the first lens should have a large 
aperture. Therefore, Fresnel lenses are used in the optical system. Compared to 
the internal focusing, this configuration has a higher working distance of 100 
mm. 


3.2 Polarization model for azimuth deviations 


Lee and Chao found the azimuth deviation of the polarizer is the same as 
the deviation of the surface normal in a calibrated rotating-analyzer ellipsometer. 
The relationship can be described by Mueller matrices (1): 


Detector PSD 


Beamsplitter 


Sample 


Figure 3.2: Combination of topometry (PSD) and ellipsometry (adapted from Wirth ). 


M neas = Ma 5 R(A) M sample : R(-P) s Mp, (3.1) 
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where Mmeas» Msampie, Ma, Mp and R are the Mueller matrices of the 
measured matrix, the nondepolarizing isotropic sample, the analyzer, the 
polarizer and the rotation matrix, respectively. P and A of the rotation matrix R 
represent the rotation angle of the polarizer and the analyzer. They used Eq. 
and a three-intensity imaging ellipsometer to measure the surface topography 
and the coating thickness of a lens. 


Neuschaefer-Rube and Holzapfel proposed a method to measure surface 
geometry and material distribution. They used the internal focusing to collect 
the reflected beam from the curved surface, which is introduced in section|3.1 
Despite of the similar configuration, the surface inclinations can be determined 
directly by the polarization model without other topometric measurement 
methods. The polarization model is expressed as: 


M meas = R(Oout) . R(—¢) . M sample . R(¢) ý R(0in), (3.2) 


where Oin, Oout and @ are the azimuthal rotation angles on the principle plane 
of focusing optics. The angle of incidence and the surface orientation can be 
obtained by the eight-zone-measurement algorithm. After the scanning for the 
whole surface, the profile can be reconstructed from the surface inclination 
(gradient data). 


Johs and He used a return-path ellipsometer to measure samples which 
have a wobble effect. The configuration of return-path ellipsometry will be 
introduced in Fig. They established a Mueller matrix model to describe the 
measurement system. The model is shown as: 


M neas — R(rec) x M sample s Mirror z Msample j R(src), (3.3) 


where rec and src are the rotation angles of the receiver and the source. They 
compensated a +0.8° substrate wobble and reduced the signal variation to less 
than 2%. 


Li et al. considered the effect of the incident plane deviation and proposed 
a Mueller matrix model to describe the Mueller matrix of the tilt surface as: 


M neas = R(—a) e M sample ° R(a). (3.4) 
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They believed the two rotation have the same absolute values but different signs. 
By using Mueller matrix ellipsometry, they successfully measured the oxide 
layer thickness and the curvature radius for a spherical lens. 


Duwe et al. [6] modified the Muller matrix model of Li et al. because of a 
significant mismatch at larger tilt angles. The modified model is described as: 


M neas = R(—ô) . Megample i R(y), (3.5) 


where ô and y are rotation angles of the Mueller rotation matrix. In contrast 
to the model of Li et al., they assumed the two rotation angles have different 
signs and values. They used a spectroscopic imaging ellipsometer to measure 
single-layer coating on a microlens. 


3.3 Return-path ellipsometry 


In return-path ellipsometry, the light beam reflected from the surface is reflected 
back to the same position from the surface by a mirror [20|[2]. Fig. 3.3]illustrates 
the schematic of return-path ellipsometry. The advantages of this configuration 
are simple construction, suitable for process monitoring, and higher sensitivity 
to the optical properties of surfaces than conventional ellipsometers. Please 
refer to [3] for more details. 


In most semiconductor process, samples usually need to rotate to obtain uniform 
layers, e.g., plasma-enhanced chemical vapor deposition and epitaxial growth 
process. The rotation of samples inevitably produces a wobble effect because 
the rotation axis and the surface normal of the sample are not parallel. As 
mentioned in section] ellipsometry is very sensitive to the angle of incidence 
and the sample position. In order to obtain accurate measurements, Haberland 
etal. (9) used return-path ellipsometry and replaced the plane mirror by using a 
spherical mirror. In geometry ray tracing, every ray which passes the vertex of 
the spherical mirror is reflected back along the original path. This configuration 
can effectively reduce the error from the angle deviation for sample rotation and 
sample wobbling during the manufacturing process. 


In order to solve the alignment problem between the sample and the detector, 
Hartrumpf and Negara developed a laser scanner to overcome this limitation 
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Sample 
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Figure 3.3: Configuration of return-path ellipsometry, where PSA, PSG and NPBS are polarization 
state generator, polarization state analyzer and non-polarizing beamsplitter, respectively. 


by a retroreflector (retroreflective sheet). The principle is based on return-path 
ellipsometry which is shown inß.3] They used a retroreflector as a reflector. A 
retroreflector can return the light beam from the sample back on the same beam 
path with only a phase difference of 180°. In other words, the polarization effect 
is the same as an ideal mirror. In this configuration, the alignment condition for 
the sample and the detector is fulfilled at an angle deviation up to 30°. Chen et 
al. used this concept to develop a ellipsometer and measured the ellipsometric 
parameters and the refractive index for nonplanar surfaces 1451. 
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4 Discussion and comparison 


For ellipsometric measurements of nonplanar surfaces, combination oftopometry 
and ellipsometry is a straightforward method. This method can achieve very high 
lateral resolution (about 2.8 um in (19). However, the hardware of topometry 
will increase the complexity of the whole system, especially for the system 
alignment and the calibration. In addition, these methods use a focused beam. 
In order to acquire accurate results, correct focusing planes of the measurement 
beam are important. Auto focusing methods are applied in these approaches. 
For the measurement of the whole surface, vertical and xy scanning for every 
point are necessary, which is very time-consuming. 


Polarization models for azimuth deviations provide another solution for surface 
geometry. This method can be easily applied to conventional ellipsometers 
without extra hardware. Nonetheless, the range of topometric measurements is 
limited to a small range because the polarization characteristic of the analyzer 
(waveplate) is sensitive to the incident angle (8). Waveplates are constructed 
by birefringent materials and designed for a normal incident angle. Thus, the 
retardance of a waveplate will change when the incident angle is not normal. 
Large incident angles for the waveplate will induce significant errors of the 
retardance. On the other hand, if the sample is tilted, according to the calculation 
inf] the beam offset from the detector is large. Adjustment of the position for 
the detector is necessary and also time-consuming. 


Return-path ellipsometry has a high sensitivity of optical properties of materials 
due to the double reflection from the sample. Special reflectors (spherical 
mirror and retroreflector) can achieve ellipsometric measurements for nonplanar 
surfaces. However, the disadvantages of this configuration are the need for a high 
power light source and the polarization distortion induced by the non-polarizing 
beamsplitter. The non-polarized beamsplitter loses a large amount of power of 
the light source (more than 75%). Moreover, the non-polarized beamsplitter is 
not an ideal component in polarization optics 11822]. Therefore, the calibration 
of the beamsplitter is necessary. 
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5 


Summary 


In this report, we have introduced different approaches for ellipsometric mea- 


surements of nonplanar surfaces including the principles and the main features. 


Each approach has its own advantages, disadvantages and suitable application 
fields. Conventional ellipsometers can only measure samples with flat or nearly 


flat surfaces. However, there is an urgent need for ellipsometric measurements 


of nonplanar surfaces in the market, e.g., lens coatings and varnish layer on 


metallic objects. Further research should be conducted in theory and hardware 


development for needs of industries. 
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Abstract 


Understanding and interpreting a scene is a key task of environment perception 
for autonomous driving, which is why autonomous vehicles are equipped with a 
wide range of different sensors. Semantic Segmentation of sensor data provides 
valuable information for this task and is often seen as key enabler. In this report, 
we’re presenting a deep learning approach for 3D semantic segmentation of 
lidar point clouds. The proposed architecture uses the lidar’s native range view 
and additionally exploits camera features to increase accuracy and robustness. 
Lidar and camera feature maps of different scales are fused iteratively inside 
the network architecture. We evaluate our deep fusion approach on a large 
benchmark dataset and demonstrate its benefits compared to other state-of-the-art 
approaches, which rely only on lidar. 


1 Introduction 


One of the key challenges of autonomous driving is the understanding of the 
vehicle’s environment. Therefore, autonomous vehicles are equipped with a wide 
range of sensor modalities, usually including, camera, lidar, radar and ultrasonic 
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sensors. With different complementary sensors available, shortcomings of an 
individual sensor type can be compensated by other sensor types, increasing 
accuracy and robustness. In this work, we focus on camera and lidar sensors. 
Understanding and interpreting a scene is a key task of environment perception 
for autonomous driving, which makes semantic segmentation of sensor data 
valuable. For camera images, assigning a class label to every image pixel has 
been addressed very successfully with Convolutional Neural Networks (CNNs) 
over the past years, achieving impressive results on road and urban scenes (5). 
When dealing with 3D lidar point clouds however, the first challenge is a proper 
representation, enabling the application of CNNs. One possibility is the lidar’s 
native range view, which has shown promising results (15} (16). This allows the 
application of established image segmentation architectures. 

Having different sensors available with an overlapping field of view, allows for 
approaches that fuse the data of different sensors to improve the robustness and 
overall accuracy. When addressing the fusion of camera and lidar data, some 
challenges arise. One is a substantial difference in their resolution and another is 
their considerable difference in measurements and sensor space. While a camera 
observes brightness values resulting in an image, a lidar measures the distance 
to its environment, generating a sparse 3D point cloud. Additionally, different 
fusion strategies must be considered. Following (4). these are the fusion of the 
sensor data (early fusion), the fusion of the predictions for lidar and camera 
data (late fusion) or the fusion of the featues maps inside a CNN (deep fusion). 
In this work, we propose a deep fusion approach, applied to the range view 
representation, which makes use of camera and lidar data to calculate a semantic 
segmentation of lidar point clouds. The contributions of this work are twofold: 


e First, we propose a fusion module, which takes camera and lidar features, 
transforms them into a common space and fuses them afterwards. 


e Second, we propose a fusion architecture building upon the fusion modules 
and apply them iteratively throughout our network, following the idea of 
iterative deep aggregation BS]. This way, we are able to fuse aggregated 
features of both sensors at different scales and maximize the fused 
information 
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2 Related work 


2.1 2D Semantic segmentation 


The success of deep learning applied for scene parsing and semantic segmentation 
is closely related to its success in classical image classification 
7. One widely used approach are Fully Convolutional Neural Networks 
(FCNN) [13], which calculate a pixel-wise prediction for a given image in 
an end-to-end fashion. replaced the fully connected layers of common 
classification architectures with 1x1-convolutions, thereby replacing the original 
image classification with a pixel-wise classification. 

One main challenge, recent works have focused on, is the loss of spatial resolution 
while aggregating information. It is of great importance to capture the global 
context of a scene as well as fine local structures. DeepLabv3 addresses 
this by ’atrous’ convolutions, which increase the size of the receptive fields 
without reducing resolution or increasing filter sizes. ’Atrous’ convolutions with 
different rates are employed in parallel to exploit context at different scales. 
In [26], an aggregation architecture is presented, which the authors call deep 
layer aggregation (DLA), also targeting the challenge of extracting meaningful 
semantic features while preserving spatial information. PSPNet combines 
local and global context by a pyramid pooling module, which aggregates the 
global context at different scales and appends it to the original feature maps. 
OcNet adapts the idea of the pyramid pooling module and multiscale ’ atrous’ 
convolutions by introducing an object context module, which exploits object 
context at different scales, instead of spatial context. 


2.2 3D Semantic segmentation 


When addressing semantic segmentation of 3D point clouds with CNNs, the 
first thing to consider is the representation of the point clouds. In recent 
works, multiple different representations are proposed. PointNet uses 
the raw and unstructured point clouds directly as input by applying pointwise 
1x1-convolutions and a symmetric operation for feature aggregation. Because a 
single global feature aggregation limits the ability to capture spatial relations, the 
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authors proposed PointNet++ 119], which applies individual PointNets to local 
regions and aggregates the resulting local features in a hierarchical fashion. 
converts the point clouds into a voxel grid and applies a 3D-FCNN, followed 
by a Conditional Random Field (CRF) to refine the results. A bird’s eye view 
(BEV) with the vertical axis as feature channel is used by to retrieve a 2D 
representation of the point clouds. Having a 2D representation, they’re using 
the U-Net architecture (21), known from image segmentation. 

When working with point clouds generated by a lidar sensor, the range view 
is another possibility of representation. SqueezeSeg was one of the 
first works using the range view for a segmentation task. Their goal was the 
segmentation of road objects, with an improved version released in [25]. Another 
approach is RangeNet++ [16], which employs the DarkNet53 backbone 
for full semantic segmentation. proposed LaserNet, which uses the range 
view as input for object detection, while one of their intermediate results is a 
semantic segmentation of the input. Their architecture is based on deep layer 
aggregation. Transforming the point cloud into its range view and applying 
established 2D image segmentation architectures mostly outperforms other 
forms of representations while being faster. Therefore, our work also builds 
upon the range view representation. 


2.3 Multimodal 3D semantic segmentation 


Multi-sensor fusion architectures using camera and lidar mostly focus on object 
detection [I5]. Only also tackles the task of 3D semantic 
segmentation, using the range view as input representation. Camera image 
feature maps, extracted by three ResNet blocks (7). and extracted lidar feature 
maps from the range view are concatenated and passed to a LaserNet, which 
serves as DLA for the semantic segmentation. In contrast to applying early fusion 
and fusing the RGB values with the range view, this approach aggregates camera 
image information first, using the original usually much higher resolution of the 
camera image. This deep fusion allows for more information being preserved 
and exploited for the semantic segmentation of the lidar point cloud. While 
considerably improving the mean Intersection over Union over all classes (mIoU) 
on distant content (+5.19), the overall improvements are rather small (+0.25). 
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We’re also using deep layer aggregation and the full camera image resolution for 
deep fusion of camera and lidar. In contrast to [15]. which fuses the features 
before applying their DLA network (LaserNet), we’re applying a DLA network 
to both, the lidar range view and the camera image, separately but fuse both 
networks following iterative deep aggregation (26). As a result, our deep fusion 
approach is able to aggregate and use more information from the camera for the 
semantic segmentation of the lidar point cloud. 


3 Iterative deep fusion and aggregation 


In this section, we present our range view input representation, our fusion 
module and the network architecture, used for the fusion of the lidar and camera 
input. 


3.1 Range view 


Commonly used lidar sensors usually observe their environment by spinning 
a set of vertically stacked lasers around their vertical axis. The position of a 
laser in this stack is often referred to as channel, corresponding to an elevation 
angle. The Velodyne HDL-64E, used to record the SemanticKitti dataset (1) {6}. 
has 64 channels, an azimuth resolution of approximately 0.17° and an elevation 
resolution of I for the upper and a for the lower half of the lasers. The 
sensor provides measurements 0; = (Ci, Qi, ri, €i), With channel id c;, azimuth 
angle &;, measured distance r; and reflectance e;. The corresponding 3D points 
are 


Ti ri cos(8;) cos(d;) 
Pi = | yi | = | ri cos(;) sin(:) | , (3.1) 
2; Tri sin(6;) 


omitting correction factors. The elevation angle 0; is derived from the sensor 
configuration and the channel id c;. 

We generate a range view by mapping every point or measurement to a row and 
column index. Having measurements from a Velodyne HDL-64E, the row and 
column indices are calculated by using the channel as row index and discretizing 
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Figure 3.1: range view showing the lidar depth measurements. 


the azimuth angle. If only the 3D points p; are provided, the azimuth and 
elevation angle are given by 


=, 
Qi = —arctan2(y;,2;) and 6; = arcsin (=) . (3.2) 
Ti 
Finally, for arange view resolution of h x w, the image coordinates uË = (ut, vt) 
are 


re [0:5 TE | ee ay 
' [0.5 -n (1+ ee )| 8; < Bria 


vË = Jos. (1+ &) w , (3.4) 
T 


with a vertical field of view Afoy = Aup — Paown = 2° — (—24.8°) = 26.8° and 
the border angle between the two vertical resolutions Omia = —76/,°. Following 
this, we’re mapping the input measurements r, e, x, y and z to the 2D range 
view, receiving a5 x h x w input tensor R. The depth channel (r) is visualized 
in Fig. 

Ego motion, uncertainty and non-uniformity of the angles can lead to mapping 
collisions. As a result, more than one point is mapped to the same range view 
pixel. This implies not only a loss of information but also missing predictions for 
the shadowed points. The latter isn’t an issue for object detection, for semantic 
segmentation however, it has to be considered. Therefore, a post-processing 
step based on the labeled points is required to compute class labels for the 
shadowed points. Following the simplest one, we assign the same label to all 
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measurements projected on the same range view pixel. Another approach is 
based on k-nearest neighbor [16]. We will investigate the post-processing step 
in future work. In this work, we’re focusing on the feature fusion. 


3.2 Feature transformation and fusion 


A crucial part of our work is the feature fusion, which fuses the lidar and 
camera features. We’re choosing the range view as our reference system and 
project camera features into it. The inverse projection, from lidar to camera, is 
mathematically given by the equation 


cam 
t 


opm | = K - Tircan © (*') . (3.5) 
1 

with the camera matrix K and transformation matrix from lidar to camera 
Tiizcam. The calculated pixel indices define the correspondence between 3D 
points and camera pixels. For this correspondence being still valid after scaling 
the range view by £ or the camera image by a, the following extensions are 


made 
li, 


eee (a) & Pul = o Ah with a, 8 € [0,1]. (3.6) 


vi - a] 
Given scalable projection indices, we’re now able to project camera features I“ 
into the range view RÊ, following 


Re Pay | Sou. (3.7) 


This is a fixed, geometrically motivated mapping, considering only one location 
per 3D point in the camera feature maps. To capture more context and to 
compensate errors in the calibration, we apply a learnable function Fw before 
performing the fixed projection, resulting in 


=F, (1%) and RS [Pal] = IX] eu]. (3.8) 


The fusion module shown in Fig. B.2]builds upon this to implement the camera 
feature transformation. We’re using a 3x3 convolution followed by Batch Norm 
(9) and ReLu as learnable function Fw. The projected camera features and the 
lidar features are concatenated and fused by ResNet blocks. 
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Figure 3.2: The main building block of our architecture. The fusion module transforms the camera 
features into the lidar range view. Afterwards, lidar feature maps, camera feature maps and optionally 
fused features maps from the stage before are fused. 


3.3 Network architecture 


Our proposed network architecture is shown in Fig. 2.3] and has three main 
components. First, a DLA network called Lidar-Net (I) for processing the lidar 
range view and calculating lidar features. It follows the proposed architecture 
of (14). which itself is based on 126]. By using a DLA architecture, we ensure 
to efficiently aggregate multi-scale lidar features. The second component is 
another DLA network (II) with the same architecture for processing the camera 
image. Additionally, we downsample the camera image before applying the 
DLA network. The resolution of the camera image is much higher than of 
the lidar image, so the induced loss in spatial information is small, whereas 
the aggregated semantic information are considerably improved. We follow 
the ResNet architecture and downsample the camera image with a strided 
convolution and max pooling by a factor of four. This also decreases the run 
time and memory requirements. The last component are fusion blocks (IID), 
which apply the previously presented feature transformation and fusion. They 
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Figure 3.3: Our proposed fusion architecture, which fuses the lidar and camera features iteratively, 
following the idea of iterative deep aggregation (26). The labels indicate the output stride of the 
individual blocks. We use the same network parameters for (I) and (II) as (14). 


follow the idea of a feature aggregator except that they transform and aggregate 
features of different sensors instead of different scales of one sensor. 


4 Experiments 


4.1 SemanticKitti 


We’re evaluating our approach on the SemanticKitti dataset [6]. which 
contains labels for 19 classes for the single scan benchmark. A total of 22 
labeled sequences results in 43552 labeled scans. The official split allocates 
sequences 0-10 for training and sequences 11-21 for testing, for which the labels 
haven’t been published. However, the official benchmark doesn’t support the 
usage of the camera images, meaning for our evaluation, only the sequences 
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with published labels 0-10 can be used. Therefore, were excluding sequences 02, 
06 and 10 from training and validation and use them only in the end for testing. 
This results in 6963 frames for testing and 16238 for training and validation. We 
follow the official evaluation metric and report the mean Intersection-over-Union 
(mloU). For our approach, only the lidar scan parts overlapping with the camera’s 
field of view in the front of the car can be used. 


4.2 Implementation details 


Our training starts with an initial learning rate of 1074, which is then multiplied 
in each training iteration it by 10 mx . Thereby, the learning rate exponentially 
decreases by Tie during training. We train our network for 50k iteration with a 
batch size of 40. To improve generalizability and reduce overfitting, we’re using 
random crops of the whole 360° lidar scan for training the lidar net. Although 
the crop is random, it follows the constraint, that the overlapping field of view 
with the camera has to be fully inside the crop of size 64 x 1536. The fusion 
modules finally crop the resulting lidar feature maps exactly to the overlapping 
field of view. Additionally, we apply random flipping horizontally to the lidar 
and camera images. 

To counteract the class imbalance, we’re using a class-balanced cross entropy 
loss for the final output as well as the auxiliary loss. The latter is used on the 
final feature map of the Lidar-Net. Following the proposed settings of PSPNet 
[29], we’re weighting the auxiliary loss by 0.4 


4.3 Results 


We evaluate our approach and present the improvements gained by the fusion of 
lidar and camera image features. Therefore, we compare the results of our deep 
fusion architecture, called Fusion-Net, to Lidar-Net, which uses only the lidar 
scans. The results of both approaches are shown in Tab. Overall, our fusion 
approach outperforms Lidar-Net by a considerable margin, and also the majority 
of the individual classes considerably benefit from the deep fusion approach. 
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Lidar-Net 
Fusion-Net 


93.1 76.8 56.1 3.4 67.1 81.7 42.0 23.2 39.8 29.0 
77.0 55.9 0.4 74.0 82.0 37.8 26.4 43.1 29.1 


vegetation 
bicyclist 


Approach 


oO 
Lidar-Net | 78.0 58.1 67.2 35.6 11.8 2.4 57.2 36.4 39.9] 47.3 
Fusion-Net | 81.4 65.8 72.0 42.7 11.0 0.3 59.4 49.6 45.6 


Table 4.1: Comparison of the results of our deep fusion architecture and the purely lidar based 
Lidar-Net 


5 Conclusion and Outlook 


In this work, we’ve presented a deep learning approach for semantic segmentation 
of 3D lidar point clouds. Our approach uses a range view representation of 
the lidar scans, enabling the application of established image segmentation 
approaches. Furthermore, we use camera image feature maps of different 
scales and iteratively fuse them inside our network with the lidar feature maps. 
Our experiments underline the advantages of our deep fusion approach, which 
outperforms a lidar-only approach by a considerable margin in terms of the 
mloU. Also, most of the individual classes considerably benefit from the fusion. 
For the future, we plan to further improve our fusion modules and thereby 
increase the benefits of our fusion architecture. We’re also planning a more in 
depth analysis of the benefits of fusing camera and lidar data for 3D semantic 
segmentation. 
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Abstract 


This report presents work and results on Activity Recognition using Part Affinity 
Fields for real-time surveillance applications. Starting with a short introduction 
to the motivation, this report gives a detailed overview over the key idea of 
the pursued approach and explains the basic ideas. In addition a variety of 
experiments on various subjects are presented, like i) the impact of the number 
of input frames, ii) the impact of different simple dimensionality reduction 
approaches, and iii) a comparison on how multi-class and binary problem 
formulation influence the performance. 


1 Introduction 


Anomaly detection amongst other strongly related topics like outlier and novelty 
detection, plays an important role in various research fields as network traffic 
monitoring, time series analysis, medical image analysis, and video surveillance. 
However, when talking about anomalies in the context of video surveillance 
the understanding of what an anomaly actually is can differ strongly between 
applications. For instance, an anomaly can be an abandoned suitcase at a public 
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place, a vehicle driving through a pedestrian zone or suspicious or salient behav- 
ing people. Recent progress in the fields of classification, pattern recognition 
and time series prediction has also brought the field of activity recognition 
into the focus of application-oriented research for surveillance scenarios. This 
report proposes a new human pose based approach on classifying the behavior 
of pedestrians. It presents details of the architecture and various experiments 
on different considered time horizons as well as various considerations that 
were conducted in order to tackle the problem of activity recognition in such 
scenarios. The presented approach is preparatory work for future activities on 
human-centered abnormal behavior detection. 


2 Part affinity fields for activity recognition 


2.1 Human pose estimation in the wild 


Human Pose Estimation describes the problem of estimating a skeletal represen- 
tation of a person based on information gathered using certain types of sensors. 
The skeletal representation is typically represented as a graph G = (V, Æ) where 
V CR” is a set of keypoints and EC V x V is a set of edges connecting 
various keypoints. Depending on the chosen skeletal model the graph can be 
seen as a tree. Usually the used sensors are classical video cameras or depth 
cameras delivering RGB or RGB-D information respectively. This work focuses 
on the 2D case using classical cameras and RGB data. This decision is driven by 
the corresponding problem domain, namely video surveillance in urban setups, 
where typical camera setups consist of RGB cameras. To this point, RGB-D 
cameras are rarely used. As a consequence, the resulting skeletons produced by 
human pose estimation algorithms consist of keypoints in a two-dimensional 
space with VCR?. 


2.2 Part affinity fields 


For this approach, the framework OpenPose by Cao et al. (1). which belongs to 
the group of bottom-up methods, is used. Contrary to top-down methods, bottom- 
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merged body parts 


left leg left arm torso right arm right leg 


Figure 2.1: Based on the Part Affinity Fields provided by the OpenPose framework body part 
maps encoding the presence of a body part are constructed. This is done per frame. The resulting 
body parts B for the highlighted person are shown in the lower part of the figure. For visualization 
purposes the body parts are merged into a single layer image to show that they form an understandable 
representation of pedestrians. 


up methods first locate all keypoints in a given image, which are connected in a 
subsequent step. To do so, the method proposed in [1] computes Part Affinity 
Fields (PAF) that are used to connect estimated keypoints by adding further 
semantic information about visible body parts. In detail, the computed PAFs 
are used for constructing a bipartite keypoint graph that is subject to the final 
optimization problem which is solved using the Hungarian Method (7). 


2.3 Architecture 


Since the aim of activity recognition in surveillance scenarios is to have a near real- 
time processing of video footage, typical activity recognition frameworks are not 
applicable due to their large network architectures and resulting strong hardware 
requirements. As a result, the focus of this work lies on developing an approach 
using a much smaller neural network. For the task of image classification 
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Howard et al. [3] proposed a network architecture called MobileNet that is much 
smaller, i.e. much less parameters, than most competing architectures achieving 
comparable performance at the same time. Due to the promising performance 
of MobileNet for the classification of single images the decision was made to 
adapt it for an custom and fast activity recognition approach. This choice brings 
a constraint into the considerations: Working on raw 2D keypoint coordinates 
is not possible with the pre-defined architecture. However, two possibilities 
come up when dealing with this problem. The first way is to transform the 
keypoint coordinates into images, the second to use the body representation 
already provided by the OpenPose framework. Since it is easy to obtain the 
latter and is available out of the box when using OpenPose the decision was 
made to adapt the Part Affinity Fields instead of the raw coordinates. 


2.3.1 From part affinity fields to human body parts 


In order to reduce the number of inputs semantically corresponding Part Affinity 
Fields Fart = (Fpart,2 Frart,y) are aggregated to five body parts: torso, left arm, 
right arm, left leg and right leg. The following equation shows the formula for 
computing the corresponding body part using the Part Affinity Fields related to 
the left leg. 


Bietileg = y r+ (Fticattx + FietCaty) + v A: (Feeernignx + Fieremnighy) -D 


where A € R* is a scaling factor. Note that B encodes the presence of a body 
part rather than its direction since this information is lost by transforming the 
Part Affinity Fields into body parts. However, since the information about single 
body parts is still available and no further reducing operations are performed, it 
is still possible to infer the orientation of the represented person. 


2.4 Training dataset 


The decision to use Part Affinity Fields as input to the model architecture makes 
it impossible to use a pre-trained network since the input volume has five instead 
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of three channels. Furthermore, the input channels do not correspond to typical 
structures like they can be encountered in RGB images. Therefore the chosen 
architecture has to be trained from scratch. 

In order to train the network three existing datasets were merged: INRIA Xmas 
Motion Acquisition Sequences (IXMAS) [11], UT-Interaction and IOSB 
Multispectral Action Dataset (2). All these datasets come with different sets of 
activities Ayxmas, AUT-Interaction and rosg. For this work, the available activities 
were merged to get a total of 19 activities from all three datasets. About 80% of 
the available activities come from IXMAS, which is the most diverse dataset of 
these three. Another reason to chose these datasets is motivated by the viewing 
perspective and the size of the urban outdoor environment, which showed the 
best fit to the field of application. 

Since all datasets consist of video sequences it is possible to make use of 
temporal information, which would be typical done by tracking pedestrians. For 
the initial setup no tracking is considered for performance reasons as tracking 
of multiple targets would introduce further expensive computations. However, 
an alternative way to benefit of the available temporal information is inspired 
by the anchor cuboids used in A. A schematic overview over this principle is 
given in Figure[2.2| Given a bounding box B; at timestep t and a window size k 
an input volume is constructed by simply aggregating the spatial information at 
the location of B; over the last k timesteps 


Bir = Bi-n41 8... © Bi-2 © Bi_-1 © By (2.2) 


where ® describes the concatenation operation along the channel axis. Note 
that each bounding box B; contains spatially corresponding information from 
all five body part channels and hence can be written as 


B; = (BiettLeg; BieftArm; Biorso, BrightArm BrightLeg) (2.3) 


2.4.1 Multi-class approach 


As mentioned in the introductory part of this section the used dataset consists of 
19 activity classes with sequences taken from three different public datasets. The 
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Figure 2.2: In order to avoid further computation overhead by performing tracking of pedestrians, 


bounding 
box cuboid 


temporal cuboids are build as shown in red. For a given timestamp t, a given window size k anda 
bounding box B+ we merge the content of the corresponding bounding boxes By_x41,..., Be. All 
bounding boxes share the same location at different points in time. 


set of available activities ranges from everyday activities like walking, sitting 
down up to more unusual ones like kicking and punching. In the multi-class 
approach every movement of a pedestrian is classified into one of the available 
classes. For the training of the network the broadly utilized Adam optimizer 
was used with an initial learning rate of 1073. As the training objective, 
cross-entropy was chosen as it is the most common loss for classification tasks. 
The learning rate is reduced by a factor of 0.1 after each 20 epochs without 
improvement on the used validation set. The whole training was conducted 
on an Nvidia DGX-1 using a single Tesla V100 card with 32 GiB of memory. 
This allows to use large batches with around 500 samples per batch. The actual 
number of samples contained in a single batch is chosen empirically and depends 
on the regarded number of timesteps k. Figureß.3]illustrates the overall setup of 
the final architecture, which takes as input a set of k subsequent body part sets 
of a given person detection Bir. The input is then processed by the adapted 
MobileNet model and classified as one of the 19 regarded activity classes. 


2.4.2 Binary approach 


In addition to the multi-class approach, a binary classification task with the aim 
to distinguish between target activities and non-target activities was investigate. 
As target activities a subset of activities, namely kick, punch, hit and push were 
chosen, since they show quite similar and relevant activities. To encode these 
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t-k+1 


right leg right arm 
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Block 15 
Block 16 
Block 17 
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left leg 
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Figure 2.3: Input to the model is a volume of size 224 x 224 x 5k where k is the number of 
timesteps that are taken into account. Body parts are stacked along the third axis over time. Each 
detected pedestrian and its corresponding body part maps are resized to 224 x 224 as expected by 
the MobileNet |3] architecture. Every row on the left corresponds to one of the five body parts B;, 
each column to a considered timestep. The number of neurons in the final classification layer is 
furthermore changed to the number of activity classes c € {2,19}. 


two classes, the number of output neurons was reduced to two neurons in the 
model architecture. 

A consequence of transforming the multi-class to a binary classification problem 
is the accentuation of the imbalance between the regarded classes. To address 
this problem, the imbalance was considered implicitly by changing the used 
training loss. For this reason the Focal Loss was adapted, which is an 
extension of the classical cross-entropy loss that introduces a weighting of 
samples based on the quality of their already achieved classification result. 


Leocai (pt) = —(1 — pz)? - log(p:) with yeRt (2.4) 


As can be seen in the equation above, the difference to the cross-entropy loss 
comes from an additional factor (1 — p+)” that reduces the loss for well-classified 
samples (p; > 0.5). The introduced variable y € RE controls how strong the 
influence of the well-classified samples to the overall loss can get. The higher the 
value of y the more the samples on which the regarded model already achieves 
good results affect the computed gradients and hence the training process. 
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Figure 3.1: In order to evaluate the impact ofthe presented considerations on the overall performance 
of the approach a dataset was created using two cameras mounted in different heights pointing to 
the same location that is only used for evaluation. 


3 Evaluation 


3.1 IOSB-Ka dataset 


For the evaluation an eligible dataset was created that shares many properties 
with typical public surveillance scenarios. The dataset was recorded at the 
Fraunhofer Institute for Optronics, System Technologies and Image Exploitation 
IOSB in Karlsruhe using a common video surveillance setup. It consists of six 
sequences with an average duration of 56 seconds from two different cameras 
both showing the same location. The cameras were mounted with different 
orientations and at different heights. Each video sequence shows a group of 
people performing actions from a predefined action set that comprises actions 
like kicking, punching and waving. Figure 3.1|shows two randomly selected 
frames each taken from one of the two cameras. 


3.2 Temporal window size 


The first part of our experiments were conducted to examine the influence of 
the temporal window on the overall performance. Since all sequences from our 
training dataset were recorded with frame rates between 25 and 30 frames per 
second, the number of consecutive regarded frames has to be chosen long enough 
to capture the important part of an action. Therefore a series of experiments was 
performed for values of k € {1, 6, 10, 14,18, 24}. As stated earlier the input 
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Table 3.1: The average precision on the evaluation dataset for the provided activities and the 
overall mean average precision vary slightly for different values of k. The corresponding values are 
indicated by the identifier PBAR-MFk. The model without temporal information, i.e. k = 1, is 
referred to as PBAR-SF. The best results were achieved with a windows size of k = 10. 


kick punch wave mAP 
PBAR-SF 0.32 0.37 0.42 0.37 


PBAR-MF6 0.34 0.37 0.51 0.40 
PBAR-MFIO 0.35 0.42 0.6 0.48 
PBAR-MF14 0.2 0.39 0.52 0.41 
PBAR-MFIS 0.3 0.40 0.64 0.46 
PBAR-MF24 0.30 0.45 0.59 0.45 


to the model is a sequence of k consecutive 5-tuples and can be seen as five 
sequences showing the temporal behavior of different body parts. Table [3-1] 
shows the results for different window sizes. It is obvious that all approaches 
perform almost identical for both activities kick and punch. Even for a time 
window of almost a second (k = 24) the results do not improve. However, the 
results for wave are better and the benefit of including temporal information is 
clearly visible. The reason for the results on the first two activities might be 
due to very similar motions in the training dataset and the far wider variety of 
forms for the same activity in the test set. Another explanation could be, that 
the model could not learn to distinguish between similar activities. This has to 
be investigated in future with additional experiments. 


3.3 Dimensionality reduction 


A major drawback that comes up when increasing the number of timesteps 
and hence the size of the input volume to the neural network is the rising 
computational complexity at training as well as at inference time. While the first 
is not too much of a problem, the latter means a direct effect on the frame rate and 
hence on the ability to be close to real-time. As a consequence the question comes 
up, whether a reduction of dimensionality achieves comparable performance 
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Figure 3.2: Given a timestep t and a corresponding input volume Bir. each body part of the input 
volume is reduced to a single channel. The illustration shows the result of the reduction exemplary 


for the right leg Bi sntteg? where i stands for the corresponding offset to the current timestep. Given 
k timesteps, the chosen reduction function Freduce ({ Bi ghtteg |i € {0,...,k — 1}}) is applied 


merging on each pixel along the time axis. It is obvious that the max-reduction is much more 
comprehensive for the human, however, the mean-reduction does not use any relevant information. 
It keeps information about the velocity of the action through the amplitude of the output signal. 


to the full temporal information. Hence, the network was trained on merged 
inputs using two ways to reduce the dimensionality: max and mean reduction. 
Furthermore, the decision was made to keep the spatial information and perform 
reduction just over time dimension, i.e. fusing just the information corresponding 
to the same body part. Figure B-2]illustrates how the dimensionality reduction 
works in principle and shows the corresponding outcome for our two considered 
reduction functions. 


Table 3.2: In order to investigate the effect of dimensionality reduction two simple approaches were 
applied to the best performing model PBAR-MF10: max and mean reduction. In both cases the 
resulting reduced input volumes do not carry enough information, so that the performance of the 
trained model drops significantly. The last row also provides results for the non-reduced model 
as comparison. 


kick punch wave mAP 
max 0.29 0.29 0.57 0.38 


mean 0.28 0.33 0.57 0.40 
without 0.35 042 0.66 0.48 
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Applying reduction to the input decreases training time approximately about 
80% from 29.5 minutes to 5.7 minutes per epoch. This effect cannot be observed 
during inference time where the overall processing time stays identical. A 
possible reason for this might be that the training is performed using Python. 
Python is not optimized for memory efficiency and hence moving data from 
RAM to VRAM might be a bottleneck. The final productive system is written in 
C++ using the libtorch library provided by the PyTorch development team, which 
seems to work more efficient when shifting data between devices. However, 
Tableß.2]indicates that in both cases the performance decreases significantly by 
a similar amount when using these simple reduction mechanisms. 


Since mean and max reduction appeared to be too strict approaches according 
to the results presented in Table 3.2] further investigations on the impact of 
dropping intermediate frames in order to reduce the input size were performed. 
The dropping is performed in an equally spaced manner using an offset s € N 
and hence results in timesteps t,t — s,t — 2s, ...,t — (k — 1) - s. Written in a 
more compact way, a sequence of k timesteps with an offset s consists of ordered 
samples B,_; with € Ik s and 


Ins ={tENolix< (kK-1)-s A ImeENo:i=m-s } (3.1) 


This method is referred to as striding. Furthermore, for given k,s € N the 
function «(k, s) describes the sampling window size. 


k(k,s)=(k-1):s+1 Vk,seN (3.2) 


For s = 1 this resembles the original set of timesteps t, ....t — k + 1 for a given 
input window size k € N. Hence, the sampling window size « equals the input 
window size k. Table [3.3] shows results for a stride s = 3 on input window 
sizes of k € {6,10} and compares them to results of approaches with a similar 
sampling window size without striding. The decision to use a stride of 3 was 
made empirically. The results indicate that the performance is almost identical 
to the non-strided experiments and therefore it is possible to achieve similar 
performing results on a reduced frame rate. By using a stride s = 3 the effective 
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Table 3.3: Two strided approaches with an offset s = 3 and input window size of k € {6,10} are 
compared with non-strided that share a similar sampling window size x. The results show that the 
sampling window size is more important than the number of actual considered frames, since the 
performances does not decrease significantly when taking less frames for comparable «. 
x kick punch wave mAP 
PBAR-MF18 18 033 0.40 0.64 0.46 


PBAR-MF6s3 16 0.32 0.44 0.54 0.43 


PBAR-MF24 24 0.30 045 0.59 0.45 
PBAR-MF10s3 28 0.28 0.41 0.65 0.45 


frame rate is approximately one third of the original and hence around 10 frames 
per second. 


3.4 Multi-class vs. binary problem formulation 


For the evaluation of the binary problem formulation PBAR-MFIO was again 
chosen as baseline. As explained in Section[2.4.2| the number of classes was 
reduced. The idea behind this decision was that it might be easier for the 
network to distinguish between ordinary and non-ordinary activities. Splitting 
the available data in such manner would lead to more samples per class and 
hopefully a better performance. As Table indicates this is the case. By 


Table 3.4: For this experiment again the best performing architecture, PBAR-MF10, was chosen 
and trained in binary and multi-class manner. On the presented test set the binary approach performs 
approximately 18% mAP better than its multi-class counterpart. 


mAP 


multi-class 0.48 


binary 0.66 


tackling the problem using a binary problem formulation the mean average 
precision could be increased on the regarded test set by about 18% mAP. 
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However, a major drawback of the binary formulation is that it loses the ability 
to distinguish between the kind of activity that was perceived. This makes it 
more difficult to understand the decision of the model, especially when the target 
class consists of a variety of different activities. 


4 Conclusion 


This report presents our work on Part Affinity Field based Activity Recognition. 
It gives an overview over how to include the information of the Part Affinity 
Fields provided by the OpenPose framework into a lightweight approach, which 
is designed to work for real-time applications. Furthermore, various topics like 
i) the impact of the number of input frames, ii) the impact of different simple 
dimensionality reduction approaches, and iii) a comparison between multi-class 
and binary problem formulation and how they influence the performance were 
evaluated. Future work will address further aspects in order to improve the 
performance and take a closer look on the temporal aspect of the approach: 
Does the usage of tracking algorithms improve the performance compared to 
the temporal cuboid approach? Can we benefit from the incorporation of Spatio- 
Temporal Affinity Fields (9)? How does MobileNet with 3D convolutions [6] 
perform? In addition to that, more elaborate yet fast dimensionality reduction 
approaches like PCA or LDA as well as incorporating an understanding of 
similarity of activities into the approach will be subject to future investigations. 
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Abstract 


The optical and digital resolution, as well as the signal-to-noise ratio are important 
characteristics of optical spectrometers and available in data sheets. But how 
can an optical spectrometer system be selected for a specific application? 
The article shall serve as an aid to characterize optical spectrometers and 
hyperspectral cameras by introducing a benchmark calculation which indicates 
the measurement uncertainty of absorption bands. 


1 Introduction 


In optical spectroscopy, the wavelength depended intensity of light is measured. 
Due to the interaction between light and matter, the direction of the light 
propagation can change by elastic scattering processes. Furthermore, light can 
be absorbed by interaction with molecules, which changes the intensity of the 
light. The wavelength dependent probability of light scattering and absorption 
depends on the material properties of the sample. Therefore, it is possible 
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to determine material properties of the sample by recording its reflected or 
transmitted optical spectrum. Applications can be found in various fields like 
smart agriculture, food industry as well as in petro chemistry 91. 


Due to the continuously advancing development of microsystems technology 
(MEMS), miniaturized spectrometers and hyperspectral camera systems can 
be manufactured cost-effectively and in large quantities. In order to achieve a 
comparability of sensors of different types, a benchmark parameter is presented 
below, which links the sensor noise with the optical and digital resolution. 


In the following chapter the state of the art in chemometrics is briefly explained. 
Afterwards, signal generation and detection are discussed in more detail. Finally, 
the findings are used to define spectral features and sensor characterization. 


2 State of the art in chemometrics 


For the statistical analysis of spectroscopic data, the research discipline of 
chemometrics has developed within the field of chemistry. In the following, 
the state of research on theoretical simulation and in addition, the established 
pre-processing methods of chemometrics are referred. 


Mainly core statements are given. For detailed information meaningful sources 
are given in each section. 


2.1 Theoretical spectroscopy and simulation of spectroscopic 
results 


Molecular vibrations can be excited by interaction with light, which causes 
an absorption of the light due to the law of energy conservation. For better 
understanding it is useful to consider light as particles, which are called photons. 
The energy of a photon is given by its frequency, which can also be expressed by 
a wavelength using the speed of light. And as result form quantum mechanics, 
only discrete energy levels of molecular vibrations can be excited. Both, 
the fundamental law of energy conservation and the discrete energy levels 
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of molecular vibrations lead to the simple result, that only photons with a 
wavelength, that matches these energy levels can be absorbed. 


However, the anharmonic potential of atomic forces lead to an non-linearity in 
the energy levels of molecular vibrations. Therefore, the energy levels change 
strong in a solid-state or liquid sample caused by the presence of additional 
atoms, temperature or pressure. For this reason, theoretical spectroscopy is still 
a field of research. Simulation of spectroscopic results is only possible in case 
of simple molecules in solutions with a sparse concentration (i). 


Furthermore, the transfer of chemometric calibration models to other products 
is quite impossible. This means, the calibration of sugar content of apples only 
can be applied to apples and not to other types of fruit. 


2.2 Chemometric methods for spectral preprocessing 


In the previous sections, the focus was on absorption and its relationship to 
material properties. However, the absorption can only be detected indirectly, 
whereas the reflected or transmitted light can be detected directly. Therefore, 
several methods have been developed to correct non-linearity of absorption, 
scattering effects and transfer of chemometric calibration models. 


2.2.1 Absorbance units 


In chemometrics, light which is not detected by the sensor (1 — r) is referred as 
absorption, often this signal is also expressed in 


a := log(1—r) (2.1) 


absorbance units (AU). Where r describes the reflected signal detected and 
discretized by the sensor and logarithms are used due to the exponential 
relationship between absorption and substance concentration by the Beer- 
Lambert-Law. 
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2.2.2 Scatter correction 


In chemometrics, no distinction is made between the physical processes of elastic 
and inelastic light scattering. Only the terms absorption or reflection/transmis- 
sion are used. Nevertheless, it is known that elastic scattering effects from 
Mie or Rayleigh theory have an impact on the spectrum and a scatter correc- 
tion is necessary. Therefore, a Multiplicative-Scatter-Correction (MSC) or a 
Standard-Normal-Variate (SNV) is often applied as a pre-processing method 
110]. Another approach is to derive the spectrum, which is often combined with 


smoothing operations 7. 


2.2.3 Instrument transfer 


An optical spectrometer records the spectrum of the light and converts it 
into a digital measurement signal. Depending on the instrument used, the 
spectrometers differ in their spectral range as well in their optical and digital 
sampling resolution. However, devices of the same type and manufacturer often 
differ in mechanical tolerances. For this reason, various methods for the transfer 
of calibration models have been developed [6] 3]. 


3 The spectral signature of a sample 


The following section will describe the signal components ofthe optical spectrum 
in the near and short wave infrared (780 nm — 2500 nm). In the optical spectrum 
the physical effects of scattering and absorption are superimposed. Nevertheless, 
the spectrum can be evaluated by chemometric calibration models or machine 
learning methods. The amount of training data required for this can be reduced 
by making specific pre-assumptions. With the following model some physically 
motivated assumptions about properties (baseline, absorption bands) of the 
spectral signature (see fieß.1) can be formulated. 


This information model is used in chapter 5.1 to define characteristics. Finally, in 
chapter 5.2 a characterization of spectral sensor systems based on the detection 
of these features is proposed. 
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Figure 3.1: The spectral signature of an object is generated from the superposition of elastic and 
inelastic scattering (absorption). The absorbing molecule groups create Gaussian-shaped absorption 
bands. Their size allows quantitative analysis of the ingredients. Due to the wavelength-dependent 
elastic scattering processes (Mie- and Rayleight scattering), a smooth baseline is created. 


3.1 A stochastic model to describe the spectral signature of 
a sample 


The interaction between light and the sample can be described by a model 
of stochastic processes (5). Therefore, the spectral signature of the sample is 
given by the probability density functions of rg (A) € [0, 1] and transmission 
to (A) € [0, 1], depending on the wavelength A, the angle ¢ € [—1/2, 1/2] of 
the incident light from the light source and the angle 0 € [—7/2, 7/2] of the 
reflected or transmitted light. Both angles are related to the surface normal. A 
graph is used to describe the light and matter interaction (see fig[3.2): The light 
source radiates photons with the probability of Ng(A) € [0, 1] within the time 
period T onto the sample. Multiple elastic scattering processes s; ;(A) € [0,1] 
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can occur within and between different layers ¿i,j € N of the sample. For 
homogeneous materials without packaging, the number of layers can be reduced 
to one. From the surface, photons can be emitted in different angels 6 as 
observable reflection and transmission. In addition, photons in each layer 
n € N can be absorbed a,,(X) € [0,1] . For better readability the wavelength 
dependence is not explicitly referred at every point. The angles are usually 
unknown and cannot be measured. The angle-dependent scattering effects 
mainly appear when using very different samples or when comparing different 
measuring instruments. Therefore these quantities are given as an index. 


light source reflection 


Sample Ns 


$11 absorption 


52,2 


Sn-1n 


Sn,n 


transmission 


Figure 3.2: The spectral signature of the transmitted tg or reflected rg light from a sample is 
formed by multiple scattering s;,; and absorption an processes within the sample. The scattering 
or absorption can differ in the different layers such as packaging, peel, pulp. In addition, the spectral 
response of the light source is also described as a probability density function Ng. The angles of 
incidence of the light source are named by ¢. The angles of emission of transmission and reflection 
are named by 0. 
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The probability for emissions a; € [0, 1] in the state of absorption is depending 
on the concentrations c; € [0,1] of absorbing molecules. So the chemical 
information is not directly observable. But energy conservation can be assumed, 


such that 
San + [roodo + [tosað =N; (3.1) 


is valid. A general case for multiple light sources or directions from diffuse 
illumination can be created by adding a sum or integral over d. 


Using eq. some fundamental cases can be named: 


Specular reflection f rod = Ng: In case of specular reflection, all light 
is reflected. There is no transmission or absorption of the light. 


Total absorption `, an = Ng: There is no measurement signal in case 
of total absorption. 


e Diffuse reflection f tgd@ = 0: This assumption is valid for samples of 
an infinite thickness. The reflected light is given by 


[oa = Nya 


e Diffuse transmission f rọdð = 0: This assumption is valid for liquid 
samples. The transmitted light is given by f ted? = Ng — >>, an 


To minimize the angular dependency of the reflected signal, a diffuse illumination 
is usually used. 


3.2 Absorption 


The origin of absorption bands in the near and short-wave infrared are molecule 
groups with an polar hydrogen bonds like (OH, CH, NH, SH, COOH, ...) absorb 
the light. An absorption process becomes possible when the wavelength (energy) 
of the light matches the energy levels of the polar hydrogen bond within these 
funcional molecule groups. 
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The absorption 


(3.2) 


is a sum over all absorbing molecule groups BI. The energy levels X; of 
the molecule groups mentioned are overlapping and also shifting non-linear 
depending on the sample composition. The with of the absorption band is given 
by p; and the concentration follows the Beer-Lambert law 


G=1-e* (3.3) 


with an absorption coefficient a; € IR depending on the dipole moment of the 
molecule. As described in the model (see fig 3.2), the absorption can also 
change in different layers, e.g. apple peel and fruit flesh. Therefore different 
absorption functions a;(A) must be used. 


However, the analysis of spectral data results in an ill-posed inverse problem: 
based on an detected absorption band, it is usually not possible to know which 
molecular group is the origin of the absorption. 


3.3 Diffuse reflection and transmission 


The scattering parameters s; ;(A) of a sample vary depending on the microstruc- 
ture (surface roughness and particle as well as molecule size). Using this 
scattering parameter, the reflected (transmitted) spectral signature 


ro(A) = (: = 5 «0)) 81,9(A) (3.4) 


t 

results from light, which is not absorbed and scattered out of the top (bottom) 
layer of the sample. The scattering parameter can be explained by the Mie and 
Rayleigh theory. Because the required parameters such as illumination angle 
and measuring distance are not known in many cases, the scattering parameter s 
is assumed to be a continuous and smooth function. Furthermore, it is assumed 
that the scattering parameter in the region of an absorption band can be assumed 
to be locally constant. 
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4 Model for measurement systems in optical spec- 
troscopy 


The individual steps of signal generation are shown step by step in fig. After 
the explanation of the spectral signature in the previous section, the spectroscopic 
measurement system is now in focus. An optical system is used to project the 
spectrum onto a detector. The detector converts the optical signal into a digital 
measurement signal. 


Sample optical system Detector 


No Fo Lm) Fay Of Lee 


Figure 4.1: The Ng (A) photons emitted by a light source are reflected after interaction with the 
sample S. The angular and wavelength dependent reflectance of the sample forms the spectral 
signature rg (A). An optical system (e.g. poly- or monochromator) is used to project the transformed 
reflectance spectrum fg (A) onto a detector D. 


4.1 Optical system 


The optical resolution is diffraction-limited in the case of grating spectrometers 
and can be calculated with known grating, slit and distances. For this the 
Rayleigh criterion is used, the resolution limit AA describes the radius of the 
Airy disk. However, this profile can also be well approximated by a Gaussian 
curve. Therefore, a spectral band ¿ € N of the optical system can be approximated 
with a point spread function (PSF) based on a Gaussian function 


hi(A) = ERBE.  ; (4.1) 
i Var PPSF l 
which is mathematically easy to handle. In the case AX = 0 of an ideal optical 
system the transfer function h;(A) = ö(A — A;) is generated. In data sheets the 
resolution of the optical system is usually specified by the FWHM (Full width 
(at) half maximum). Which is also related to ppsr = FWHM 


21/2 In(2)* 
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The reflection signal 
Fo(A) = h(A) *ro(A) =: up(A) (4.2) 
which is projected onto the detector defines also the photon current which used 


in the next chapter. 


Another property of the point spread function is the smoothing of the reflection 
signal. This leads to an attenuation of the absorption bands (eq. 3.2). Using eq. 
[4.2]and assuming Gaussian functions in eq. 4.1]Jand eq. 3.2]a new attenuated 
parameter for the concentration 


gan BE (4.3) 


\/ Psr + P? 


can be specified. 


4.2 Detector model based on EMVA1288 


The EMVA1288 standard contains a comprehensive description of the various 
signal contributions in semiconductor detectors and the digitization that follows. 
However, the EMVA1288 standard is used to characterize camera sensors 
without optics and refers to illumination with monochromatic light. 


The noise (variance) of the grey values of a spectral band 


of = K*o4 +03 + K (Hi — Hi dark) (4.4) 


results from the amplified dark noise og, the quantization noise 1/12 DN. The 
fluctuations of the photon stream are subject to a Poisson distribution and are 
signal dependent. 


The signal 


Hi = J rect (A) Fo(A)n(A) KAA + K haak (4.5) 


of a spectral band results from the signal sampled over the range AA. 
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Figure 4.2: EMVA1288 Sensor system: The photon current up is subject to a fluctuation op. In 
the semiconductor, the photons generate electrical charge carriers pe with the quantum coefficient 
n. Thermal excitation produces an additional dark current ua. The charge carriers are amplified 
analogously by the factor K and then converted into a quantized measurement signal u; by an 
analog-digital-converter (ADC). 


4.3 White and black balance 


A white and black balance necessary is because of the spectral characteristics of 
the light source N, (A), the quantum efficiency 7(A), as well as the additional 
dark current of the detector Hy dark. The signal 
= Hi — Hi,dark (4.6) 
Hi,ref = Hi ,dark 
can be calculated based on a reference spectrum of a sample with known 
reflectance and the dark signal. This wavelength dependent scaling leads to an 
amplification 
Ti 


gamme (4.7) 
Hi,ref = Ki,dark 


of the noise of spectral bands. In many cases, a significant increase in noise can 
be observed at the borders of the spectrum. 
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5 Sensor characterization 


In order to characterize a spectrometer system (see fig/4.1) the already introduced 
properties of light source N,(A), optical system h(A) and detector D will be 
combined in the following. The aim is an estimation of the measurement 
uncertainty for the detection of absorption bands. 


5.1 Features in optical spectroscopy 


The intensity of an absorption band can be used to quantify sample properties, 
as introduced above by Beer-Lambert’s Law. Therefore, the absorption bands 
will be defined as features 


ma = L a “ (a) 81,0(Aj)dA = cj : 81,6(A;), (5.1) 


where the scattering parameter s; 9(A,;) is assumed to be locally constant. These 
features are attenuated by the optical system and are recorded with noise. Using 
the relation m; x cj and eq. [4.3Jand[4.7]lead to a standard deviation 


2 2 
\/ PPsr + P3 
VO 27g (5.2) 
Pi vn 
in the detection of spectral absorption bands. The optical attenuation of the 


absorption bands in the first term has an amplifying effect. Depending on the 
digital resolution, the noise influence is reduced by acquisition with n channels. 


Om; X 


5.2 Example for a new benchmark calculation in optical spec- 
troscopy 


From laboratory tests it is known that for recording moisture the feature m at 
Am = 1350 nm with a width of pm = 50 nm must be used. Two spectrometer 
systems with different characteristics are available. One system with low noise 
(sensor A) and high resolution (sensor B). 
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Table 5.1: Sensor comparison: Sensor A has a lower optical resolution, the spectral range of 
900 — 1650 nm is recorded with 128 bands. Sensor B has a high optical resolution, the spectral 
range of 900 — 1700 nm is recorded with 255 bands. Due to the lower light per spectral band the 
noise of sensor B is increased compared to sensor A. 


Sensor A | Sensor B 


Ppsr in nm 20 5 
Bands/nm 0.18 0.32 
Og@Am in nm 0.01 0.02 


By multiplying the digital resolution Bands/nm by the width of the absorp- 
tion band pm the number of n spectral bands involved in the sensor can be 
determined. This results in an estimated standard deviation of the feature m 
with om,a = 0.0039 for sensor A and aom,g = 0.005 for sensor B. For a general 
comparison of the two sensors the trend from om over pm is shown in figureß. 1] 


Sensor A 
0.016 4 
—— Sensor B 


0.014 4 


0.012 4 


0.010 4 


Om 


0.008 4 


0.006 4 


0.004 + 


Figure 5.1: With increasing width of the absorption band which is to be detected, the number 
of spectral channels in the sensor increases, whereby the influence of the optical resolution also 
decreases in proportion. With a defined absorption band, the measurement uncertainty of the sensors 
can therefore be compared in the graph. 
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6 Summary 


The signal generation in optical spectroscopy was described in terms of stochastic 
processes starting from the light source up to interpretable features. The focus 
was on signals in the near and short wave infrared spectrum. For quantitative 
statements on sample properties, the characteristics of absorption bands were 
justified and their signal portion was presented in reflection and transmission 
measurements. 


Based on the absorption bands as quantitative features in optical spectra an 
estimation of the stochastic measurement uncertainty was formulated. For 
this purpose, the optical resolution was combined with the detector properties 
according to EMVA1288. As aresult, spectroscopic measurement systems can be 
characterized by the expected stochastic measurement uncertainty. The definition 
of task-specific requirements for the resolution of certain absorption bands 
enables a benchmark for spectroscopic measurement systems as a whole. The 
approach can be generally used for hyperspectral cameras including illumination 
and optics or novel compact spectrometers from the consumer sector. 
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Abstract 


Diffractive optical elements (DOEs) can produce high-numerical-aperture (NA) 
spots over a large field. They can be combined with a low-NA objective to 
measure a large area with high resolution. This work shows experiments of 
using DOEs in reflection confocal microscope to resolve small structures beyond 
the capability of the objective. Both qualitative and quantitative results have 
shown enhancement in lateral and axial resolution by the application of the 
DOEs, which also agrees to the imaging theory of confocal microscopes. 


1 Introduction 


Confocal microscopy has been widely used as a standard measurement method 
in many fields for years (11). The resolution of a confocal microscope is mainly 
dependent on the numerical aperture (NA) of the objective. Objectives with 
higher NAs can produce smaller illumination spots, and thus they can image 
the sample with high resolution. However, high-NA objectives are very limited 
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Figure 1.1: Design procedures of the DOE which generates spot array with overlapping apertures (4). 


in the field of view and only a small portion of the sample can be measured 
at a time. Besides, high-NA objectives are also expensive due to complicated 
optical design and manufacture. 


In order to solve the problems, diffractive optical elements are proposed to be 
used in confocal microscopes pel. They can produce spots with overlapping 
apertures. Unlike the microlens array, the NA for a produced single spot is no 
longer limited by the unit area above it. In contrary, the surrounding area also 
contributes to the spot. So the proposed DOEs can produce a dense spot array 
with a high NA. The design procedures if the DOEs are described by Fig.[T- TP]. 
First, a required target spot field distribution Uspot is defined. By simulation, the 
target field propagates back and forms a spherical-wave-like field distribution 
ur. Rayleigh-Sommerfeld integral BA is used as the simulation method 
for the diffraction field propagation, which is shown as 


, ew tklr—r'| y a 
U = N Uspot (T Dei dy A (1.1) 


where X denotes the surface on the boundary, i.e. the plane which Uspot lies 
on and the semi-infinite sphere behind it, r = (x, y, z) is the coordinate of ur, 
r’ = (a’,y’, 2’) is the coordinate on D, and k is the wave number. 


By utilizing the idea of overlapping aperture, uy is duplicated and overlapped 
with a designed pitch to form the overlapping field up. Then the phase of up is 
extracted and binarized with a binarization factor B 3] as follows, 


op(a,y) = mod (=E 2) T (1.2) 


T 
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Finally the binarized field propagates back by simulation to validate the design 
result. In this case, a dense spot array with a high NA can be generated by 
plane wave illumination. And the produced spots with an NA up to 0.77 has 
been demonstrated 5]. However, when being used in a confocal setup like 
Fig.[1.2] such DOES introduce significant disturbances in the imaging path. In 
order to realize the setup, the DOEs need modification to increase the zero-order 
diffraction which allows the generated spots to be imaged through itself. This is 
done by adding a plane-wave component to the overlapped field up, 


i =p W, (1.3) 


where W is a constant which is optimized iteratively to achieve the best signal- 
to-noise ratio of the spots in the image. In this way, the disturbance added by 
the DOE is significantly reduced when the spots are imaged through it, which 
has already been demonstrated by experiments (5). The resulting DOE, which 
is called See-through DOE, can thus be used in the confocal microscope in 
Fig. 1.2. And both lateral and axial resolution can be enhanced as shown by 
theory and experiments in the following chapters. 


Camera Beam Splitter Objective 


Figure 1.2: Setup of a confocal microscope using the DOE. A laser is collimated by the objective to 
illuminate the DOE. The produced spots are again imaged by the objective onto the camera sensor. 
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2 Resolution enhancement by the DOE 


The confocal microscope setup in Fig. 11.2Juses a low-NA objective to increase 
the field of view and a see-through DOE to project high-NA spots. The objective 
also acts as a collimator to produce plane-wave illumination for the DOE. It 
is suitable for both opaque surface measurement and fluorescence microscopy, 
which transmission microscopes cannot measure. When an opaque surface is 
measured, lateral resolution can be significantly increased by the DOE while 
axial resolution can only be slightly increased A]. When fluorescent samples 
in transparent medium are measured, both lateral and axial resolutions can be 
increased which is comparable to a high-NA objective. 


2.1 Theory of scanning microscopy 


The image formation of a scanning microscope can be described as the following 


equation : 


U (a2, ya; Zs, Ys) (2.1) 
= If hı(zo, Yo)t(to — Ls, Yo — Ys)h2 (2 ae vo) dzodyo 
= ’ ’ M ? M ? 

(2.2) 


where (£o, yo) is the object coordinate, (x2, y2) is the image coordinate, (£s, ys) 
is the scanning position, h1 (xs, ys) is the illumination point spread function 
(PSF), ha(xs, ys) is the imaging PSF, and t(29, yo) is the object transmissivity or 
reflectivity. For a confocal microscope, a point detector is used at x2 = y2 = 0 
and the intensity at every scanning position can be expressed by 


oo 2 
Ka =| ff hı(zo, yo)t(xo — Ls, Yo — Ys)h2 (—a0, —yo) dxodyo| , 


(2.3) 


which can be simplified to the following equation because the PSF is even, 


I(as,ys) = |hıha * t? , (2.4) 
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Figure 2.1: Simulation of a single point with different illumination and imaging PSFs. The dashed 
line represent a wide-field microscope configuration with a single PSF. 


The equivalent PSF for the confocal image thus becomes hıha. In this case, if 
the illumination hı is high-NA and the imaging ha is low-NA, the combined 
PSF will still be dominated by the high-NA illumination. 


Figure 2.1]shows the simulation results of the lateral intensity profile when a 
single point is imaged by illumination and imaging with different NAs. It is 
shown that when the illumination has high NA, the combined confocal PSF is 
independent of the low-NA imaging objective and is slightly smaller than the 
wide-field high-NA curve. Thus the lateral resolution can be increased by such 
a setup in Fig. This is also very similar to the principle of super-resolution 
microscopy like STED [2] or PALM il. Similarly, the axial resolution can also 
be increased for a point-like object in fluorescence microscopy, which has been 
explained in (4). 
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(a) DOE confocal scanning image. (b) Wide-field image. 


Figure 2.2: Images of a resolution target taken by an objective with NA=0.15. 


2.2 Experiments of the DOE in confocal microscopy 


Experiments are also made to test the resolution enhancement by the DOE in a 
confocal microscope as shown in Fig. [1.2] A standard positive USAF resolution 
target from Thorlabs is used as the test object with a maximum resolution of 
228 line pairs per millimeter. The target is scanned laterally in a zig-zag way 
and a confocal scanning image is obtained. 


Fig. [2.2]shows a comparison of a DOE confocal scanning image and a wide-field 
image both taken by an objective with NA=0.15. It is obvious that the DOE 
increases the lateral resolution and the even finest patterns can be clearly resolved. 
The zig-zag like artifacts in the image are caused by insufficient accuracy of the 
xy stages which leads to the misalignment in the confocal image reconstruction. 
Contrarily, the wide-field image is totally blurred because the numerical aperture 
of the objective is not high enough. 


Furthermore, images shown in Fig. [2.3]are also taken by an objective with an 
even smaller NA of 0.07. There is still a very obvious resolution enhancement. 
However, the signal-to-noise ratio and the resolution of the confocal scanning 
image are also slightly reduced compared to the image taken by an objective 
with an NA of 0.15. 


There could be several reasons for this. First, the diffraction efficiency of a 
binary DOE is limited. There is unavoidable -1 order diffraction which is stray 
light and will be collected by the objective to form a noisy background. When 
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the NA of the objective is lower, the signal is weaker so the signal-to-noise ratio 
is reduced. This phenomenon can be mitigated by using a multi-level DOE to 
increase the diffraction efficiency. Moreover, the Andor Zyla 5.5 camera we 
used has a large pixel size of 6.5 um. And we use a 0.63x tube lens and a 2.5x 
objective with NA=0.07. The total magnification is 1.575, which is pretty small. 
This leads to a relatively large equivalent pinhole size of roughly 3.7 ym, which 
cannot effectively block the stray light. By using a camera with a smaller pixel 
size or a tube lens with a larger magnification can mitigate the problem. 


After the lateral measurement, the axial resolution is also tested with the fluores- 
cence microscope setup shown in Fig.|2.4| The Sphero Rainbow fluorescent 
particles are used as samples. The excitation wavelength is 630 nm and the 
emission wavelength is from 672 nm to 712nm. The sizes of the beads are 
3.0-3.4}1m. In the experiment, only one fluorescence bead is focused. The 
bead is moved vertically to measure the intensity response. Both objectives 
with NA of 0.15 and 0.07 are used for testing. The produced spots have axial 
full width at half maximum (FWHM) of 19.5 um and 17.2 pm respectively, 
which corresponds to a NA of roughly 0.25, because of the different collimation 
quality of the objectives. The confocal signals show axial FWHM of 25.9 pm 
and 26.3 pm respectively. The results show that the axial resolution is almost 
independent on the imaging objective, which is predicted by the theory. Still 
the confocal axial peaks are a bit wider than the illumination spot. The reason 
can be that the pixel as a pinhole is large, and the diameter of the bead is not 
negligible. 


(a) DOE confocal scanning image. (b) Wide-field image. 


Figure 2.3: Images of a resolution target taken by an objective with NA=0.07. 
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Figure 2.5: Axial intensity response of an in-focus fluorescent bead. 


3 Conclusion 


Traditional confocal microscopy relies on high-NA objectives to achieve high 
resolution. However, high-NA lenses have a very limited field of view. The 
See-through DOE can be used with a low-NA objective in a reflection confocal 
microscope to provide a large field of view. The DOE can produce high-NA 
spots and maintain the resolution of a high-NA objective in such a setup. 
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For surface measurements, 2D scan was performed and the enhancement of 
resolution is clearly demonstrated. Meanwhile, the See-through DOE is also 
successfully used in fluorescence microscopy. Fluorescence signals of the 
beads were observed and also axial response was tested. The axial FWHMs are 
independent of the NA of the objective, which also agrees with the theory. 


In the future, for surface measurement, the measurement uncertainty will be 
tested. For fluorescence measurement, a 3D scan of living cells will be carried 
out. New experiments are planned to further demonstrate the capability of the 
DOES to increase the measurement resolution. 
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Abstract 


The application of video surveillance systems in public areas to ensure public 
security is becoming increasingly important. A major task when evaluating the 
arising amount of video data is to find the occurrences of a person-of-interest 
on the basis of a testimony. For the comparison of a person’s description with 
persons in the video data, the attributes of all persons must be recognized 
automatically. However, typical approaches to pedestrian attribute recognition 
simply predict all attributes for a person, regardless the visibility of relevant 
attributes. To address this problem, the concept of realistic predictors is used 
in this work to determine and improve the reliability of pedestrian attribute 
recognition. 


1 Introduction 


Nowadays, more and more video surveillance systems are used to ensure public 
security. Due to the large amount of image and video footage that is recorded by 
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(a) Detections (b) Viewpoints (c) Appearance (d) Occlusions 


Figure 1.1: Different challenges in recognizing pedestrian attributes. Poor detections and occlusions 
can lead to only partially visible persons in images. Moreover, some attributes like backpack may 
not be visible from all point of views and attributes such as handbags may appear as many different 
types. 


such systems, manual evaluation is hardly possible, which is why intelligent and 
automatic analysis systems are required. One of the most important evaluation 
tasks that can be automatically solved by applying convolutional neural networks 
(CNN) is person re-identification which aims to find all occurrences of a person- 
of-interest in the data. Typically, such a search is performed based on a cropped 
image of the person the system operators are interested in. But since it is 
not possible to cover all areas with CCTV cameras, one cannot be sure that 
a query image of the person-of-interest is always available. Thus, in such 
cases, descriptions of the semantic attributes are the only clues on which the 
person search can be based. The query attributes can be easily and directly 
extracted from witness descriptions. In order to find all persons corresponding 
to the obtained attributes, the semantic attributes of the persons present in the 
surveillance material must be recognized. 


This pedestrian attribute recognition in an uncooperative, real-world scenario 
suffers from a lot of different challenges. Some of the most severe issues to 
overcome are visualized in Figure[l.1] Stable recognition of a person’s semantic 
attributes is only possible if clean cutouts are available. But sometimes person 
detectors provide bad detections which show a lot of background clutter or 
only parts of ahuman body. Moreover, the view angle is an factor that greatly 
influences the appearance of a person. Attributes as for instance backpack may 
not be visible from every point of view. Similar issues arise from occlusions 
which make it difficult or impossible to determine certain attributes. Lastly, 


96 


Realistic Predictors for Pedestrian Attribute Recognition 


attributes, such as handbag in Figure|l.1(c)| can differ greatly regarding their 
appearance. Handbags come in different sizes and colors making the recognition 
task harder. 


All those challenges indicate that meaningful attribute predictions can not be 
given in all cases. If, for instance, the lower-body of a person is occluded by a 
vehicle, no well-founded statement about the length of the lower-body clothing 
can be made. Although this is a very important topic, it is not present in existing 
literature regarding to pedestrian attribute recognition. However, with regard to 
typical one-hot classification, Wang et al. present an approach which takes 
into account the hardness of the input images and only provides classification 
results if a reliable estimation is possible. Since attribute recognition, albeit 
multi-class, is a classification problem as well, the core idea of this work is to 
transfer and adapt the concept of realistic predictors to this task. 


2 Related work 


Generally, pedestrian attribute recognition approaches from related literature 
can be roughly divided into three different categories: global, part-based and 
attention-based methods. 


Global Models Especially early deep learning-based works on pedestrian 
attribute recognition predict semantic attributes on solely a whole body image of 
a person. In for instance, a multi-branch architecture is applied that contains 
a separate classification layer and loss for each attribute. In contrast, some 
works showed that it is advantageous not to learn all the attributes separately 
but instead learn them all together [7] or partitioned in groups of corresponding 
attributes [i]. In addition to that, the authors in propose to weight the 
attributes during loss calculation according to their frequency of occurrence 
in the dataset to handle the large imbalances of attribute values. The results 
of newer works 115], however, indicate that with the development of larger 
CNN models the joint learning of attributes is not always beneficial and higher 
accuracies can be achieved if separate networks are used for different attributes. 
In general, global models are simple and therefore very efficient compared to 
more complex architectures. These results in faster training and testing, though 
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only using coarse information. Differences between global attributes, as gender, 
and small-scale attributes such as shoes or glasses are not taken into account 
and aggravate the recognition task. 


Attention-based Models Attention-based methods aim to guide the network to 
focus on the most important regions of activation maps or features. 
propose networks that are capable of implicitly learning visual attention maps. 
A special feature of is the use of a multi-directional attention mechanism 
which means that attention is shared between different semantic layers of the 
network. Moreover, Sarfraz et al. introduce an approach to learn view- 
sensitive embeddings since the viewpoint of a person is really important with 
respect to the appearance of attributes. To improve attention maps explicitly, 
in attention maps are refined using a exponential loss function. Although 
some attention-based methods are proposed in literature, the gain in accuracy 
is still limited compared to other research fields such as for instance person 
re-identification. 


Part-based Models Part-based algorithms jointly leverage local and global 
information to improve recognition accuracy. This is done by either localizing 
body parts of persons using an extern [4|[9] or intern module. In 
patches obtained from a part detector are fed into a fine-grained classification 
model. Similar to that, proposes to use the detector features of the whole 
person and detected parts as input patches for attribute classification layers. A 
slightly different way is followed in 91. Instead of bounding boxes estimated by 
a body part detector, pose key points are exploited to localize meaningful body 
part regions. In contrast to these approaches, introduces a method by which 
part localization and attribute classification is jointly learned in an end-to-end 
manner. In BI, the authors use mid-level image patches as representations 
of human body parts. Moreover, LGNet is presented in (ii). Consisting of 
a global and a local network branch, part detection is performed by creating 
so-called EdgeBoxes that are applied in a Region-of-Interest pooling module. 
Such part-based models are less efficient compared to simple global models but 
instead are able to focus on fine-grained information which is very important for 
recognition of very local attribute, as for instance glasses or shoes. However, it 
is important that body parts can be accurately detected because otherwise the 
approaches suffer from focusing on irrelevant regions of the input image. 
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Although part-based models implicitly handle the visibility of body parts or 
attributes, none of the approaches in literature deal with the fact that in a 
uncooperative real-world scenario attributes cannot be predicted for imperfect 
person image crops or occluded body parts. Therefore this work aims to close 
this research gap by investigating the concept of realistic predictors which is 
detailed in the following. 


3 Methods 


In this chapter, the baseline classification model is presented followed by a 
detailed description of the realistic predictor approach. 


3.1 Baseline model 


The baseline model is based on the typical classification pipeline for global 
pedestrian attribute recognition. Images are pre-processed and data augmen- 
tation is performed. Afterwards, images are fed into a backbone network 
with appended fully-connected classification layer and output probabilities are 
computed using the sigmoid function. In this case, the task is considered a 
multi-class classification task which means that all attributes are simultaneously 
predicted using a single classification layer. Sigmoid cross-entropy loss function 
(SCEL) is applied to train the CNN model. To handle the imbalanced distribution 
of positive attribute labels in the dataset, a weighting factor is added to the loss 
computation as proposed in 7]. Let yf € [0,1] be the target label of the cth 
attribute of the sth sample and p“ the positive ratio of this attribute in the dataset. 
Then the weighting factor w? can be computed independently for each attribute 
and input image as follows: 


i (1—p*) if yo = 
we = a. ) ify; 68.1) 
exp( 2z) ’ if y; = 
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Figure 3.1: The general idea of realistic predictors. On the left, the architecture is shown consisting 
of two branches: a classification and a hardness prediction one respectively. The figure on the right 
depicts the testing stage. Samples with a hardness score above a threshold T are discarded and not 
fed into the classifier. 


a stands for a hyperparameter which is set to 1 in all experiments. This weighting 
factor ensures that the network focuses on rare attributes by increasing the weight 
of such samples. 


3.2 Realistic predictors 


The concept of realistic predictors is adapted from (17). The general approach 
is visualized in Figure A network with two branches was designed to 
simultaneously train a classifier and a so-called hardness predictor. The classifier 
outputs probabilities p; for each class whereas the hardness prediction network 
computes hardness scores. Hardness scores s; are understood as predictions 
of the difficulty of the classification task for a specific input image. So, for 
instance, the hardness score should be higher if an object is only partially visible 
in comparison with a clean cut of the object of interest. The testing protocol 
is visualized in Figure [3.1] on the right. First, the hardness for all samples 
is predicted. To find those images for which no reliable classification can be 
provided, hard samples are discarded based on a threshold T. The remaining 
samples are then forwarded through the classifier and a class prediction is 
produced. In practice, only attributes for which the classifier is certain would be 
output and then used for person retrieval. 
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Two different losses are used to train the two network branches. For training the 
classifier, the use of a weighted softmax cross-entropy loss function is proposed. 
This loss function L,,, is shown in the following equation where N stands for 
the number of samples in the batch and pF depicts the predicted probability for 
target class c and sample 7. 


N 
-X 5; log pf (3.2) 


As mentioned earlier, the original paper deals with a one-hot classification 
problem in contrast to the pedestrian attribute task. Persons have several 
attributes at the same time, like a woman wearing a red shirt and blue jeans, and 
thus multiple classes can be true. Therefore the loss function for the multi-class 
task is adapted as follows: 


N C 
-XX [yf log pf + (1 — yf) log(1 - pf) (3.3) 
i=1 c=1 
In addition to the sum over all samples, the sum of cross-entropy losses for all 
attributes is computed. C denotes the number of different semantic attributes in 
this case and yf € [0, 1] is the target label of the cth attribute. 


Another alteration that was made is that the feedback of the predicted hardness 
score s; is omitted in contrast to the original paper. Whereas the authors propose 
this term to focus on those samples that are particularly hard during training, this 
is not necessarily beneficial for attribute recognition. In the object classification 
approach one can be certain that the object is actually present and visible in the 
input image. In contrast, especially small-scale attributes are often occluded 
and therefore not visible which could lead to a decrease in recognition accuracy 
if such samples are preferred during the training process. The network would 
not be able to base its decision on meaningful clues and to learn important 
information. 


For training the hardness predictor, another loss function is proposed in 17]. 
=] ps log (1 — 5) + (1 — pf) log si] (3.4) 
i=1 
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The goal of this function is to produce large hardness scores if and only if the 
cross-entropy loss of the classification branch is high and vice versa. Therefore, 
a kind of inverse cross-entropy loss is used. The loss function gets minimal if 
si = 1 — pf applies. In words, the hardness score is forced to be equal to the 
classification error measured by the prediction probability. Moreover, the more 
the estimated class probability differs from the target value the higher the loss 
of the hardness predictor. 


Analogous to the classification loss function, the hardness predictor loss calcula- 
tion has also be modified to match the requirements of the multi-class attribute 
classification problem. Again, the loss function is expanded to consider each 
attribute. Since in contrast to the one-class classification problem not only one 
positive class is relevant but instead the presence as well as the absence of all 
attributes, loss calculation is also based on the target label, as can be seen in the 
equation hereafter. 


N C 
Lı=-),) [Apflog (1 — sf) + (1— Apf)log sf], 5) 


t=1 c=1 


withAps = |y; — pf| (3.6) 
Thereby, the hardness predictor learns to estimate the difficulty of an image 
regardless of an attribute being present or not in the training image. This is 
ensured by applying the absolute value of the difference between the target class 
label yf and the predicted probability of the presence of an attribute pf instead 


of using pẹ directly. 


Since the training of the hardness predictor network also suffers from data 
imbalances, DeepMAR weighting can be applied here as well, thus reducing the 
influence of unbalanced attributes distribution on the training. 


3.3 Determination of thresholds 


To improve the accuracy of pedestrian attribute recognition, meaningful thresh- 
olds for hardness scores need to be determined. It is important to find a good 
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trade-off between improving accuracy and rejecting as few samples as possible. 
Thus, multiple strategies to seek for meaningful thresholds are proposed and 
compared in the evaluation chapter. The thresholds are computed for each 
attribute independently making use of the evaluation data. To avoid that too 
much samples of an attribute are discarded, optimization is stopped as soon as 
the threshold is below that of the quantile rejection method. 


Threshold rejection As a baseline for comparison of the other rejection 
approaches, one single threshold which is applied to all attributes is deter- 
mined. 


Quantile rejection In contrast, quantile rejection method sets the thresholds to 
a value so that a predefined portion of validation samples is discarded. Since 
the distribution of the hardness scores may vary between validation and testing 
data, the proportion of rejected samples can differ during testing stage. 


Mean accuracy / Fl rejection This rejection approach aims to optimize the 
target evaluation metric, either mean accuracy or Fl score. The threshold value 
is lowered until the mean accuracy no longer increases or until the stop criterion 
mentioned above is reached. 


4 Evaluation 


The previously introduced approaches are evaluated and discussed in the 
following. After some details about the datasets used and the experimental 
setup, the results of the experiments are presented. 


4.1 Datasets 


The experiments are conducted on two different publicly available datasets. 
Both datasets contain person bounding boxes that are all taken from videos of 
surveillance cameras. A brief introduction to RAP-2.0 and PA-100K datasets is 
given in the following. Some sample images of both dataset can be found in 


Figure 
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Figure 4.1: Randomly selected images from the datasets are shown for comparison. Figures (a) - 
(e) are taken from the RAP-2.0 dataset whereas Figures (f) - (j) are from the PA-100K dataset. 


The RAP-2.0 dataset consists of 84,928 images taken from 26 different 
cameras. All cameras were mounted indoor and show scenes of a shopping mall. 
72 different binary attributes ranging from gender to attachments are annotated. 
Since the distributions of the attribute annotations are highly unbalanced, only 
those attributes with a positive ratio greater than 1 % are used in the experiments. 
After discarding very rare attributes, 54 attributes remain whose positive ratios 


are shown in Figure|4.2| 


Unlike the RAP-2.0 dataset, the PA-100K dataset contains images recorded 
in an outdoor setting. According to the dataset name, 100,000 images from 598 
different cameras are included and 26 binary attributes are provided. Moreover, 
distributions of attribute annotations are more balanced. 


4.2 Experimental setup 


Data pre-processing and augmentation During training phase, images are 
resized and randomly cropped to match the input size of the CNN. In addition, 
random flipping is applied to increase the diversity of training data. 


Backbone model Experiments with different backbone models were carried 
out. Since the observations presented in this chapter are valid regardless of the 
CNN model used, only results for ResNet-50 (6) are presented and discussed. 


Parameters To train the models, a multi-step scheduling scheme was applied in 
all experiments. Two steps are performed with a decay factor of 0.1. RAP-2.0 
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Figure 4.2: Positive ratios of RAP-2.0 attributes. Only few attributes have balanced distributions 
while most attributes such as attachment-backpack occur very rarely. 


models were trained for a total of 180 epochs with steps after 60 and 120 
epochs. The learning rate for the Adam optimizer was initially set to 1074 for 
the classifier and 10° for the hardness predictor, respectively. For training the 
networks with the PA-100K dataset, parameters were set to the values suggested 


in 2]. 


4.3 Hardness prediction 


Table [4.1] presents the attribute recognition results of the classifiers. Using 
positive ratio-based DeepMAR weighting of the loss during training significantly 
increases the recognition performance by reducing the influence of imbalanced 
attribute distributions. Moreover, the results clearly indicate that using feedback 
of the HP-Net for training the classifier network is not beneficial for pedestrian 
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Table 4.1: Quantitative evaluation of baseline methods on RAP-2.0 dataset. DeepMAR weighting 
of training samples greatly improves mA. Training the classifier with HP-Net feedback deteriorates 
the results in all metrics. 


Model mA Acc Prec Rec F1 

SCEL 64.29 62.26 82.55 70.09 75.81 
DeepMAR 73.05 63.99 77.01 77.17 77.09 
SCEL + HP-Net feedback 61.93 52.00 69.33 66.81 68.05 


DeepMAR + HP-Net feedback 67.32 61.18 76.74 73.49 75.08 


attribute recognition. In the original approach this feedback was proposed to 
force the classifier to focus on those samples which are hard to classify. But in 
contrast to typical image classification, attributes are small-scale features and 
thus not necessarily visible in hard-to-classify images. As a results, focusing 
on such hard samples confuses the CNN and accuracy decreases regarding all 
metrics as can be seen from the experimental results in the table. 


Next, it is important to evaluate the quality of the given hardness predictor. 
For this purpose, Figures and [4.4] show person images assessed as easy 
as well as hard are displayed. Figure [4.3] visualizes samples for the gender 
attribute. The qualitative results seem reasonable. It is easy for the classifier to 
classify a person as a woman if the person is wearing a skirt or has long hair 
that is clearly visible. In contrast, hard samples are images showing only partial 
persons such as the first image in Figure Also a human cannot make a 
reliable statement about the sex, because only the legs of the person are visible. 
Moreover, images on which the length of the hair is not clearly visible are hard 
to assess for the classifier and therefore more prone to misclassification. 


These observations are valid for many of the attributes but there are attributes, 
like backpack, for which different results are received. As an example, easy 
and hard samples for the attribute Backpack are shown in Figure[4.4] All easy 
samples show persons without a backpack whereas each of the persons from the 
particularly hard samples wears a backpack. So, in this case it seems that the 
decision between hard and easy images is only taken based on the presence of 
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(b) Hard samples 


Figure 4.3: Hard and easy samples for the attribute Gender of the RAP-2.0 dataset based on the 
estimated hardness scores. Samples that are considered easy or hard appear to be reasonable for 
this attribute. 


the attribute and by that equals the classifier instead of providing independent 
hardness predictions. This indicates that, albeit the hardness predictor loss 
is weighted by the positive ratio of attributes, the imbalance of attributes in 
the training data still plays a big role and influences the recognition accuracy 
negatively. Since only about 1 % of the training images show persons with 
backpacks, the network can achieve good results by only predicting no backpack. 
Thus, the loss gets minimal for such images and high for images with backpacks. 
As a result, the hardness predictor learns to discriminate between the values of 
the attribute and not to predict the hardness of the attribute recognition task. 


4.4 Realistic prediction 


Based on the finding that the hardness predictor can give meaningful estimates 
of the degree of difficulty of samples, the realistic predictor can be formed by 
combining the classifier with a hardness-based rejection. Table [4.2] presents 
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THF 


(b) Hard samples 


Figure 4.4: Hard and easy samples for the attribute Backpack of the RAP-2.0 dataset based on 
the estimated hardness scores. In contrast to Gender, persons with backpacks are considered 
hard-to-classify due to the high attribute imbalance. 


the results for different rejection strategies and compares them to confidence 
score-based rejection. Improvements in instance-based metrics can be observed, 
independent of the applied rejection method. The mA-score decreases except 
for the mA rejection. This is due to the side effects of unbalanced attributes, 
which are always predicted as false and thus reach only a minimum mA score 
of 0.5. When comparing rejection methods, threshold strategy achieves the 
best Fl scores whereas, as mentioned above, mA rejection leads to highest mA 
results. Although hardness prediction-based rejection of attributes increases the 
performance, rejection on the basis of class probabilities achieves similar or 
even better performance, especially on RAP-2.0 dataset. This finding indicates 
that the major issue with the external hardness prediction network is still the 
unbalanced distribution of attribute values and that DeepMAR weighted loss 
function is not completely capable of compensating it. 


In conclusion, it can be stated that the realistic predictor approach using an 
external hardness predictor generally works. But the assumption that such an 
additional CNN is superior to the use of confidence scores cannot be fully 
validated for the pedestrian attribute task. Both networks learn complementary 
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Table 4.2: Realistic predictor results on RAP-2.0 dataset. Rejection strategies mainly improve 
instance metrics. Hardness scores provided by an explicit hardness predictor do not surpass the 
baseline given by using confidence scores of the classifier. 


RAP2.0 PA-100K 
Rejection Strategies mA F1 Rejected mA F1 Rejected 
None 72.98 77.12 0.00 75.23 83.33 0.00 
Hardness scores: 
Threshold 69.18 83.59 12.54 74.34 90.53 15.04 
Quantile 66.07 81.80 24.58 74.20 88.08 22.54 
mA 74.02 78.93 7.75 78.09 90.32 15.18 
Fl 66.14 79.52 16.68 74.78 87.67 13.44 
Confidence scores: 
Threshold 71.77 85.98 14.51 75.87 91.20 17.32 
mA 74.79 82.88 12.13 77.88 91.00 17.44 


tasks and so the rejection rate is much lower when the hardness predictor network 
is used. However, results of the confidence score are not exceeded. 


5 Conclusion and future work 


This work aimed to apply the concept of realistic predictors to the field of 
pedestrian attribute recognition. The core idea was to address some of the biggest 
challenges in pedestrian attribute recognition while simultaneously achieving 
more reliable attribute estimates. To achieve this, the approach introduced in 
was modified and optimized for the task of attribute recognition. This 
included, for instance, adapting the loss functions and alterations regarding to the 
network architecture. In addition, different strategies to determine meaningful 
thresholds for exclusion of unreliable predictions were proposed and extensively 
studied. 


All in all the findings of this work showed that the concept of realistic predictors 
can be transferred to the field of pedestrian attribute recognition and accuracy 
improvements can be achieved. However, comprehensive experiments indicate 
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that the predictions of hardness do not reflect the difficulty of the task equally 
well for all attributes. Especially attributes with strongly unbalanced value 
distributions in the training dataset cause problems and worsen the results. As a 
result, better performance was achieved if confidence scores are used instead 
of hardness predictions. In one point, however, the hardness predictions were 
strongly superior to the confidence values, namely in the number of rejected 
samples. From this it can be concluded that training a separate hardness predictor 
has its advantages. 


In future research the training of the hardness predictor and the loss function can 
be improved in order to eliminate the imbalance problem of some attributes. The 
aim is to close the performance gap with the confidence-based rejection while 
maintaining the advantage in terms of number of rejected samples. Moreover, 
the hardness predictor approach allows to weight attributes during attribute- 
based person retrieval. By considering attributes according to their difficulty in 
predicting them during distance computation, incorrect retrieval results in early 
ranking positions can be avoided. 
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Abstract 


Distributed usage control is a form of usage control that spans over multi- 
ple domains and computer systems. As a result, usage control components 
responsible for evaluating policies, gathering information, executing actions and 
enforcing decisions are operated in the vicinity of different stakeholders with 
conflicting interests. In order to prevent malicious stakeholders from manipulat- 
ing these components, remote attestation can be used to verify the integrity of 
their code base. However, in a distributed case it is not always apparent what 
sequence of attestations is necessary and which verifier should conduct them. 
Furthermore, it is unclear what impact a failed attestation has on the trustworthi- 
ness of the whole usage control system. To solve these questions, it is necessary 
to identify which agents need to trust each other in order to securely execute a 
certain usage control function. Then the sequence of remote attestations that 
occur across the distributed usage control system can be examined accordingly. 
In this work we develop a formal model that represents the trust relationships of 
distributed usage control systems with multiple collaborating actors. Based on 
the conducted attestations we define simple binary and non-binary trust metrics 
that quantify the trust level a data owner can expect at a certain point in time. 
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Finally we show how the model can be used to determine the level of trust 
reached in a real-world scenario. 


1 Introduction 


In recent years, usage control (UC) has been more and more propagated as a 
novel technology for governing access to valuable information. Unlike classical 
access control, usage control models focus on managing the future usage of data 
7. With usage control technology it is possible to restrict access to protected 
assets even after they have been disclosed. Often usage control is used in 
distributed environments, where sensitive data are shared between shareholders. 
One such example is the Fraunhofer research project International Data Space 
[6]. The International Data Space allows data providers to distribute valuable 
data alongside usage restrictions to potentially malicious data consumers. The 
data consumer’s systems then process the received information according to 
the published rules. Naturally, the data provider wants to ensure that the data 
consumer can be trusted to obey the issued usage restrictions on his data. For this 
the International Data Space uses distributed UC modules that independently 
evaluate the usage control policies and enforce the resulting decisions. Since 
each participant of the data space may act maliciously and try to extract foreign 
data past the protection mechanisms, it is necessary to verify the integrity of the 
UC components prior to the data exchange. 


Trusted computing is the state of the art approach that allows for remote 
verification of software components. Currently the most widespread trusted 
computing technologies are Trusted Platform Modules (TPMs) (9) and Intel’s 
Software Guard Extensions (SGX) B]. Both of these technologies support 
establishing trust in remote software stacks by verifying code bases through 
special hardware and cryptographic methods. This software verification process 
is called remote attestation. Besides verifying the integrity of a software stack, 
remote attestation also establishes secure channels between prover and verifier. 
The International Data Space uses TPMs and a customized remote attestation 
protocol to establish trust in data consumers. However, when developing 
distributed usage control systems that establish trust by remote attestation, 
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several open questions remain. For example, it is not always clear which 
components have to be attested, and by which verifiers. Comprehensive usage 
control systems are complicated and security relevant UC functions may span 
over multiple distributed UC components. This is especially true if the usage 
control system also includes components that track and store the provenance 
of supervised data. In these cases it has to be ensured that all involved UC 
components are properly attested and can securely communicate with each 
other. Another interesting question is what impact a failed attestation has on 
the security of the overall system. These questions all emerge from the yet 
unsolved problem of quantifying the trust propagation in dynamically operating 
and distributed usage control systems. 


In this work we develop a formal model that can represent the trust relationships 
that occur in distributed usage control systems with multiple collaborating actors. 
This model is independent from the design of the UC-system, its implementation, 
and the used trusted computing technology. Furthermore we define simple 
binary and non-binary trust metrics that can be used to determine the trust level 
of certain UC functionalities at a specific point in time. Calculating a dynamic 
trust level for a UC system is very beneficial for conducting a comprehensive 
security analysis of the infrastructure. Finally we show how the model can be 
used to determine the level of trust in a real-world example scenario based on 
the International Data Space using TPM-based attestation. 


2 Related work 


Managing and distributing trust has been a major topic of research interest 
for a long time. By far the most widespread technique of managing trust in 
distributed systems is via a public key infrastructure (PKI). With a PKI, a few 
trusted certification authorities (CAs) issue signed public keys for the agents 
in their domain. As a result, the trust in a certain communication channel is 
reduced to the trustworthiness of the CA. Even though PKIs are a fundamentally 
important concept in IT security, as a centralized way of managing trust they 
are not applicable to our scenario. In terms of decentralized approaches to trust 
management, the most important concept is the Web of Trust (1). which has 
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been popularized by the well-known PGP software. Its main principle is to 
distribute trust transitively by endorsement of already trustworthy collaborators 
(i.e. “my friend’s friend is my friend”). Also, it is possible to generate new trust 
by offline comparison of public key fingerprints. This decentralized version of 
trust distribution already comes close to what we need for our scenario. A usage 
control component could determine the level of trust in a remote system based 
on the trust that their peers already have in it. New trust would then be generated 
by automated remote attestation instead of manually comparing fingerprints. 
However, the Web of Trust does not offer any kind of trust metric, and does not 
take possible internal attackers into account. Also it does not give any notion of 
time. 


An approach that factors in these aspects are dynamic reputation systems 
410). Their idea is to describe trust mathematically and develop a metric for 
the reputation of an agent based on their previous behavior. Simply put, if an 
agent behaves cooperatively, its level of trust increases. If the agent defects, 
the trust level is impacted. However, since it is not at all well-defined what 
constitutes as “cooperative behavior” in our scenario, reputation systems also 
do not suffice for quantifying trust in distributed UC systems. Furthermore, they 
neither define what actions are suitable to increase or decrease trust, nor do they 
deal with attestation mechanisms. Since our goal is to develop a formal model 
of distributed UC systems that works independently of the system design or the 
used attestation technology, reputation systems do not meet our requirements. 


3 Formal model 


Our goal is to develop a metric that quantifies the level of trust in distributed 
UC systems. For this, a formal model is required that describes the trusted 
communication between usage control components. Since trust relationships 
can be intuitively modeled as graphs, we utilize a graph-based approach. 
Furthermore, the formal model needs to represent attestations conducted by the 
UC components as well as the architecture of the deployed UC system. In this 
section, we develop a suitable model in three steps. 
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1. Define functions of the UC components that have to be trusted using a 
graph-based model (global). 


2. Define the existing agents and cross-system activities of interest by 
instantiating that graph (scenario specific). 


3. Define the system architecture by binding the agents to attestable systems 
(implementation specific). 


As a first step, the basic semantic of the usage control system is specified via 
a trust dependency graph. The trust dependency graph contains the existing 
types of usage control components. It describes how they need to trust each 
other for any interaction that may occur between them. In the second step we 
concretize this graph by considering the actual components that are operated in 
the distributed usage control system. For this we represent each concrete UC 
component as an instance of a node from the trust dependency graph. We call a 
concrete UC component agent, because it needs to securely interact with other 
components in the system. The resulting graph is called agent graph. Unlike 
the trust dependency graph, each agent graph is specific to a certain scenario 
that the UC system is deployed for. It also yields information about the actors 
that operate the usage control components in that scenario. The agent graph 
can be partitioned into multiple UC activities, which represent a function of the 
distributed UC system spanning over multiple UC components. We will later 
show how the trust level of a UC activity can be measured using an instance 
of the model. Finally, an architecture graph defines how the agents map to 
physical computer systems that can be attested. The architecture graph is not 
only specific for a certain UC scenario, but also depends on the used trusted 
computing technology and the deployment of UC components. Figure[3.I]shows 
an overview of the steps required to transfer the design and implementation 
of a UC system into the formal model. In the following sections we present 
this formal model in detail. Afterwards we develop trust metrics that can be 
evaluated on an instance of the model. 
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Figure 3.1: Overview of the formal model. 


3.1 Defining UC systems 


The first step of the formal model includes defining a distributed UC system and 
its components. This is done in definition 


Definition 3.1.1 (DUC system). Let M be a finite set of DUC modules and F 
a set of DUC functions. We call the tuple S := (M, F) DUC system. 


Besides the UC components and their functions, we also need to define how 
the UC components may interact with each other. This is done by the trust 
dependency graph, as described in definition 


Definition 3.1.2 (Trust Dependency Graph). Let S = (M, F) be a DUC system. 
Let E"C M x M be a set of directed edges over M and I": EF > F a 
mapping that labels each edge with a system function. We call the triple 
FG := (M, EF, IF) trust dependency graph of S. 


The trust dependency graph of a DUC system describes the inter-component 
functions that may be called across the distributed system. A trust dependency 
graph can be constructed solely with knowledge of the UC component’s interfaces. 
It is not necessary to know the use case or the usage control policies that should 
be deployed. Hence the trust dependency graph is independent of the system’s 
concrete realization and implementation. 


118 


Model for Quantifying Trust in Distributed Usage Control Systems 


An example for a trust dependency graph is presented in figureß.2] It shows 
the trust dependency graph for the XACML-based distributed usage control 
architecture that is deployed in the International Data Space. XACML [2] is 
a reference architecture that defines usage control components responsible for 
enforcement (PEP), policy evaluation (PDP), information gathering (PIP), and 
administration (PAP). Besides these XACML-based components, the usage 
control architecture of the International Data Space uses some additional 
components responsible for retrieving policies (PRP), managing communication 
(PMP) and executing obligations (PXP). The displayed DUC system is modeled as 
M = {PEP, PDP, PIP, ...}and F = {noti fy, evaluate, execute, ...}. The 
trust dependency graph shows the possible interactions and the resulting trust 
dependencies between components as labeled edges. Note that the direction of 


notify : 
evaluate notify 
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execute 


notify 
unsubscribe 


retrieve 
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(Pre | 
revoke — delete 


activate revoke 


PAP 


Figure 3.2: Example of a trust dependency graph. 


the edge defines the direction of the trust dependency, which does not always 
correspond with the direction of the interaction. For example, a PAP may revoke 
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a policy by calling the revoke function of the responsible PMP. However, the 
edge is directed the opposite way, because in this case the PMP has to trust the 
PAP that the revocation request is legit. 


3.2 Defining agents and activities 


In order to represent a specific scenario, we can instantiate the trust dependency 
graph and introduce agents that interact with each other. This is done in definition 
3.2.1 


Definition 3.2.1 (Agent Graph). Let FG = (M Ms ) be a trust dependency 
graph. Let A be a set of agents, E C A x A a set of directed edges over A and 
l: A— F a mapping. Let also be type: A— M a mapping that assigns a 
module type to each agent. We call the tuple G := (A, E, l, type) agent graph, 
if it holds that 


V(a,b) € E : (type(a), type(b)) € EF 
V(a,b) € E : 1 (a,b) = I" (type(a), type(b)) 


According to definition[3.2.1] every agent is an instance of a UC component. The 
agent interaction corresponds to the DUC functions that have been described by 
the trust dependency graph. The two conditions inß.2.1]ensure that the agent 
graph only contains edges that correspond to the trust dependency graph (i.e. 
agents can only call existing functions). Note that the agent graph may contain 
multiple agents of one particular type (e.g. if multiple PIPs or PEPs exist), while 
the trust dependency graph contains each component exactly once. 


The agent graph shown in figureß.3]is based on the example trust dependency 
graph in figureß.2] The example agent graph shows a scenario with two actors 
A and B, who operate distributed usage control components. In this scenario, 
the PXP instance of actor B is responsible for deploying policies at the PDP 
instance of actor A. This allows B to enforce usage control policies on his data, 
even if they are shared with A. Note that the agent graph contains multiple 
instances of a single UC component. For example, in this case both actors A 
and B operate PDPs, PEPs and PXPs. While the trust dependency graph is 
of a global nature and represents an abstract DUC architecture, agent graphs 
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execute 


deploy 


revoke 


notify 
unsubscribe unsubscribe 


Figure 3.3: Example of an agent graph. 


derived from it depend on specific use cases. Also note that all agent graph 
edges correspond to edges from the trust dependency graph, but not all trust 
relationships may be included in the agent graph, depending on their relevance 
for the scenario. 


Besides defining the agents of the UC system, we also have to specify what kind 
of agent interaction should be evaluated for trustworthiness. Definition [3.2.2] 
partitions the agent graph into multiple acyclic subgraphs called UC activities. 
A UC activity represents an action that requires multiple agents to work together, 
such as the deployment of policies or the enforcement of access decisions. Since 
the involved agents have to trust each other in order to reliably execute these 
actions, the trust level of a UC system will be based on the relevant UC activities. 


Definition 3.2.2 (UC Activity). Let G = (A, E, l, type) be an agent graph. Let 
H := (A, E, D be a connected subgraph of G with AC A, EC EN (A x A) 
and | := I|. We call the subgraph H UC activity of G, if 

H is acyclic 
dla € A: indeg(x) = 0 
Jy € A: outdeg(y) = 0 


The unique vertex x is called root of H. A vertex y is called leaf of H. The set 
of all leaves is denoted by Y. 
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Figure.4shows an example for a UC activity based on the agent graph in figure 
The depicted UC activity represents the necessary interaction for locally 
enforcing a policy. First, the enforcement point (PEP) notifies the decision 
point (PDP) of an access request. The PDP then evaluates the policies, requests 
necessary information at the PIP and executes obligations at the PXP. In this 
activity the PEP acts as root, while the PXP and the PIP are leaves. In order 
to trust the UC activity of local policy enforcement, all of these interactions 
need to be secure. Complex distributed usage control systems, such as the 
International Data Space, have many more relevant UC activities that can be 
identified, including remote policy enforcement, policy deployment, and policy 
revocation. However, for the remainder of this paper we will stick to the example 
of local policy enforcement. 


evaluate 


execute notify 


notify 


Figure 3.4: Example of a UC activity: Local policy enforcement. 


3.3 Defining attestations and architectures 


Finally the formal model needs to contain information about the remote attesta- 
tions that can be executed by the agents. In order to accommodate this, definition 
B.3.Tlintroduces the notion of attestation containers. An attestation container 
is a set of agents that can be jointly attested. Which agents form an attestation 
container depends on the used attestation technology and the system architecture. 
For example, if the UC system uses TPMs to execute the remote attestations, all 
UC components running on a TPM-protected computer system are included in 
an attestation container. More advanced trusted computing technologies, such 
as Intel’s SGX, allow the attestation of software enclaves rather than whole 
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computer systems. In that case, all UC components included inside such an 
enclave form an attestation container. The tuple of agent graph and attestation 
containers is called architecture graph. 


Definition 3.3.1 (Attestation Container). Let G = (A, E, l, type) be an agent 
graph. We call a non-empty set C C A attestation container, if all c € C can be 
jointly attested. The set of all attestation containers is denoted by C C P(A) \ 4. 
The tuple (G,C) is called architecture graph. 


Based on the description of the attestation container, we have to represent the 
concrete attestations that agents actually conduct during runtime. This is done 
in definition via an attestation schedule. 


Definition 3.3.2 (Attestation Schedule). Let (G, C) be an architecture graph. For 
any agent a € A we call the mapping atta : N+ x C > {-1,0,1} attestation 


schedule. The family of all attestation schedules is denoted by A = (atta) ca: 


The attestation schedule of an agent indicates which attestations the agent 
conducts at what points in time, and if they are successful. More concretely, 
if att,(t,C) = 1, then at time t the agent a conducts a successful remote 
attestation of container C. This means that a successfully verifies the integrity 
of all agents that are included in ©. If instead atta(t, C) = —1, the attestation 
fails and the agent is unable to verify the integrity of C. If att,(t,C) = 0, the 
agent a does not conduct a remote attestation of container C at time t. 


4 Quantifying trust 
The formal model allows us to mathematically represent a distributed UC system. 


Based on an architecture graph and the associated attestation schedules we can 
now define trust metrics for the relevant UC activities. 


4.1 Binary trust metrics 


Given a UC activity H, we denote the level of trust in the activity at time t by 
TrustLevel” (t) € {0,1}. A trust level of 1 means that the activity is trusted, 
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while a trust level of 0 indicates that the attestations are not sufficient to ensure 
the integrity of all involved components. In order to define this trust level, we 
are examining the paths of H and calculate trust gains for each transition within 
the path. 


4.1.1 Trust gain by attestations 


Whenever an agent a conducts a successful attestation of container C’, possible 
transitions between a and another agent c € C are trusted and positively influence 
the trust level of H. However, this positive influence only lasts as long as no 
other agent unsuccessfully conducts an attestation of C, thereby determining 
that its integrity cannot be trusted anymore. This idea is expressed in definition 
Like the overall trust level, the trust gain is binary. A trust gain of 1 for the 
transition (a, b) means that b has been attested by a, and no other agent failed in 
verifying the integrity of b since. A trust gain of 0 indicates that a has not yet 
attested a container that includes b, or that such an attestation is outdated. 


Definition 4.1.1 (Trust Gain by Attestation). Let (G,C) be an architecture graph 
and A = (atta), 4 the family of associated attestation schedules. Let H be a 
UC activity of G and (v1, ..., Un) € H a path of the activity. The trust gain by 
attestation for the transition (v;_1, vi) at time t is defined as 


AC €C,ty <t: 
1, veUr atty, (tı, C) =1A 
Va € A: Ate € |ti, t] : atta(t2, C) = —1 


0, else 


4.1.2 Trust gain by locality 


While it is clear that attesting a remote component increases trust, we also have 
to manage the trust gains of local components. If two dependent UC components 
are included in the same attestation container, they can communicate securely 
without conducting a remote attestation. However, even though a remote 
attestation is not required for establishing a secure channel, the integrity of both 
components still needs to be verified. Hence we have to demand that a previous 
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component attests both of the local components. This concept of trust gain by 
locality is specified in definition[4.1.2] 


Definition 4.1.2 (Trust Gain by Locality). Let (G,C) be an architecture graph 
and A = (atta) „c 4 the family of associated attestation schedules. Let H be a 
UC activity of G and (v1,...,Un) € H a path of the activity. The trust gain by 
locality for the transition (v;_ı, v;) at time t is defined as 


AC € Ct < t: 
i {vi-1, ui} CCA 
Gain (i t) := : Jis z. = 
rb): Aj <i: att,,(t1,C) =1A 
Va € A: Ate € [t1, t] : atta(t2, C) = —1 
0, else 


4.1.3 Putting it together 


Given the two concepts of generating trust in a distributed UC system, we can 
define the trust level for a UC activity. We can base the definition on the trust 
gain by attestation, the trust gain by locality, or both. Definition 4.1.3] specifies 
the trust level of a path by multiplying the trust gains of the respective transitions. 
The trust level of the whole UC activity is the minimal trust over all paths. 
Definition 4.1.3 (Trust Level). Let (G, C) be an architecture graph and further 
let H = (A, E, l) a UC activity of G with root x € A and leaves Y C A. The 
trust level of a path (v1, ..., Un) € H is defined as 


n 


i=2 
Depending on the scenario, the trust gain is defined by attestation or attestation 
and locality. 


Gain(i,t) := Gain“ (i,t) 
Gain(i,t) := max( Gain“ (i, t), Gain! (i,t)) 
The trust level of the UC activity H is defined as 
TrustLevel” (t):= min (TrustLevel® (0) 
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Note that the trust level definition is based on the transitions between agents 
in the UC activity, instead of the agents themselves. Unlike many existing 
reputation systems (c.f. section|2}, we do not define the trust level of a certain 
agent at all. Instead we define the trust gain of a transition within a UC activity, 
and then generalize that definition over paths to the whole activity. The reason 
for this is that remote attestation is not only responsible for verifying the integrity 
of agents, but also establishes a secure channel for communication. Hence it is 
not sufficient to focus just on the level of trust in the agent, we need to examine 
the connections between them. 


4.2 Non-binary trust metrics 


A binary trust metric can only distinguish trusted from untrusted systems. In 
order to quantify trust more precisely, we can define non-binary trust metrics. 
In that case, given a UC activity H, we denote the level of trust in the activity at 
time t by TrustLevel” (t) € [0,1]. 


A simple non-binary trust metric can be obtained by including the temporal decay 
of trust in the model. For this we introduce a dampening factor n : NÈ — [0,1] 
and modify the definitions of trust gains. 


Definition 4.2.1 (Trust Gains with Temporal Decay). 


JC EC, tı <t: 
Gain" (i,t) = n(t = tı), vu € CA atty, (t1, C) =1A 
Va € A: fits € [4,1]: atta(t2, C) = —1 
0, else 
JC EC, tı <t: 
{vi-1, vi} COA 
« loc/, = n(t — tı), Er ; 
Gain” (i,t) := Jj < i: atty,(41,C) =1A 
Va € A: Bite € [t1, t] : atta(t2,C) = —1 
0, else 


The definition of the dampening factor 7 depends on the scenario. In general, 
the choice of 7 reflects how fast the generated trust deteriorates after a successful 
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attestation. For most cases a polynomial or exponential decay should be an 
adequate choice. 


n(t) = (t +1)? 
n(t) := exp(—At) 


4.3 Example calculation 


After defining binary and non-binary trust metrics, we give an example calcula- 
tion based on the previously used International Data Space scenario. For this, we 
take the UC activity representing local policy enforcement from figure B-4and 
define suitable attestation containers. As shown in fi gure[4.1] the set of attestation 
containers results to C = {{pip}, {pdp, pxp}}. Since the International Data 
Space uses TPMs to provide proof of integrity during remote attestation, in this 
case the attestation containers represent physical computer systems. 
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Figure 4.1: UC activity with attestation containers: Local policy enforcement. 


In order to determine the trust level of this scenario, we have to specify the 
attestation schedule. We assume that the PXP and the PIP do not conduct any 
attestations in this example. 

VtEN*,C EC: attpsp(t,C) =0 

Vt ENt,C EC: attpip(t,C) = 0 
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The root PEP attests the PDP and PXP at t = 1, while both PEP and PDP attest 
the PIP at t = 2. Since in this example all conducted attestations are successful, 
the attestation schedules never evaluate to —1. 


1,t =1AC = {pdp, pzp} 
att pep(t, C) 1,t=2AC = {pip} 
0, else 
1,t=2AC = {pip} 
0, else 


attpap(t, C) = { 


Furthermore we consider both attestation and locality trust gains for this example 
calculation. Table [4.1] shows the development of the trust level for the three 
paths. 


n (0) ie n'°°(0) 


Table 4.1: Development of trust levels over time. 


At t = 0, no attestations have been conducted yet, so the trust level for all paths 
is 0. At t = 1, the PEP conducts a remote attestation of the attestation container 
{pdp, pzp}. This results in an attestation trust gain of ņn°* (0) for the transition 
pep — pdp and a locality trust gain of 7!°°(0) for the transition pdp — pxp. At 
t = 2, both the PEP and the PDP conduct a remote attestation of the attestation 
container {pip}. Then the transition pep — pip is directly attested with an 
attestation trust gain of 7°“*(0). However, the transition pep — pdp now has an 
attestation trust gain of n?“ (1), since the relevant attestation is one time step in 
the past. For the same reason the transition pdp — pzp now has a locality trust 
gain of 7!°°(1). 
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If we assume the dampening factors of the trust gains to be 


1 
att — ee 
n™™ (t) := exp( 109 


and n'°°(£) := exp(— t), the trust level of the entire activity H at time t = 2 
results to 


meri), n*(1) % n*(0), n°*(0)) 


rustLeve = min 
TrustLevel” (2 nt 
1) * 9'°°(1), (1) * 1,1) 


( 
= min (9° ( ) 
= nft#(1) lea) 
= 0.846 


*n 
*n 


5 Conclusion 


In this work we developed a formal model for quantifying trust in distributed usage 
control systems. After defining the relevant trust dependencies and interacting 
agents, we developed binary and non-binary trust metrics that quantify the level 
of trust reached in a certain scenario. While successful attestations positively 
influence the trust, failed attestations and time progression reduce the reached 
overall trust level. Finally we showed an example calculation based on the real 
distributed usage control system that is deployed in the International Data Space. 


Possible future work includes investigating how Dempster-Shafer theory 
could be applied to the formal model. With Dempster-Shafer it is possible 
to model unawareness and uncertainty of knowledge. It is also helpful in 
combining degrees of belief from different sources, which makes it promising 
for representing trust in distributed systems. There already are reputation 
systems based on Dempster-Shafer theory 12]. 


Another important approach is to evaluate to what extent the assumptions made by 
the formal model hold in practice. The presented trust metric is only meaningful 
if the used remote attestation protocol guarantees integrity verification and 
secure communication across the distributed system. However, especially for 
the widespread TPMs this assumption does not hold in all scenarios [11]. A 
more subtle problem that occurs in practice is the availability of UC components. 
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Even if the used remote attestation protocol is secure, one can never prevent 
a malicious operator to deliberately sever communications between local and 
remote usage control components. In this case it is important that the roots of all 
affected UC activities are notified about the loss of communication, otherwise 
the security of the usage control system may be compromised. Even though the 
formal model cannot directly monitor this, being able to identify relevant UC 
activities and their trust dependencies is a substantial help in auditing distributed 
usage control systems for these weaknesses. 
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Abstract 


3D data contain rich information about the full geometry of objects or scenes. 
Learning tasks on them have always been considered as hard ones in the computer 
vision community due to their extreme high dimensionality. Hence, latent 
representations of 3D geometries are often used to lower the data dimensionality 
for better parameterization and easier computation. In this report, we make abrief 
review on those latent representations obtained via different methods including 
classical ones and the emerging neural learning-based ones. Furthermore, the 
nowadays widely used deep learning methods have also been more closely 
investigated regarding their applications on various 3D data formats. The 
possibility of combing those two kinds of methods has also been addressed. 


1 Introduction 


3D data analysis has always been an interesting yet challenging research topic 
for computer vision researchers. Learning latent information from them is vital 
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to lots of advanced technology applications including robotics, autonomous 
driving, virtual reality and augmented reality. Lots of classical methods have 
been proposed to extract latent representations from 3D data. Those latent 
representations can be images, graphs, histograms, or even vectors (4). Classical 
methods usually focus more on generating latent representations of 3D shapes. 
Those generated latent representations are sometimes also referred as shape 
descriptors. 


In recent years, neural networks have been proved to be one of the most powerful 
learning algorithms for computer vision tasks, especially on 2D Euclidean data. 
Implicitly learned feature maps or bottleneck feature vectors have been used for 
classification, detection, or segmentation tasks. Later on, similar methods have 
been proposed on 3D Euclidean data with minor adaptions. However, those 
learning algorithms cannot be straightforwardly extended to Non-Euclidean 
data due to their non-grid data structure. Different special neural network 
architectures for 3D Non-Euclidean data therefore have been more meticulously 
designed and proposed, while input, output, latent representations, or even 
network operations have been more artfully defined. 


This report is structured as follows. In Section 2, we briefly review the most 
common 3D data formats. Latent representations learned by classical methods 
or neural learning-based methods are reviewed in Section 3. Section 4 gives 
a more detailed review on the application of deep learning a) for ML tasks on 
3D data and b) for the generation of latent representations that can be used 
by different methods later on. Conclusion and future outlook are presented in 
Section 5. 


2 Overview of 3D data format 


3D data have lots of different formats depending on its source. They are usually 
categorized into 2 subsets, Euclidean data, which mainly include multi-view 
images, RGB-D images, volumetric voxels or octrees; and Non-Euclidean data, 
which mainly include point clouds and meshes. Euclidean data are usually of 
rasterized forms, they have regular grids. For example, images are composed 
of pixels which are well aligned and always have same number of neighbours. 
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Non-Euclidean data are usually more of geometric forms, they do not have 
regular grids. For example, with geometric metrics, the distance between 
two vertices on a mesh should be computed as their geodesic distance on the 
manifold, other than the direct Euclidean distance. In this section, different 3D 
data formats are briefly reviewed and compared. 


2.1 Euclidean data 


Multi-view images: 3D data may be presented as a combination of multiple 2D 
images captured for the 3D object from different view points [38]. Learning with 
this format, the noise effect from incompleteness, occlusion and illumination 
problems can be well reduced. All the input views jointly optimize the functions 
to represent the whole 3D shape. However, this format requires too many input 
sources and is usually too expensive for industrial use. The question of how 
many views are sufficient to represent a shape is also still open. 


RGB-D images: With the development of RGB-D sensors, e.g., Microsoft 
Kinect, more and more industrial applications are using RGB-D images as 
the input data format for their tasks. This data format provides an additional 
depth map along with the normal 2D RGB color information. Comparing to 
other 3D data formats, there are more RGB-D data format available due to its 
inexpensiveness 7. 


Volumetric data: Same as 2D shapes can be rasterized into pixels, 3D shapes 
can also be rasterized into voxels. In this case, 3D shapes are encoded by those 
occupied voxels. Despite the simplicity of the voxel-based representation, it 
suffers from keeping the intrinsic properties of 3D shapes and the smoothness 
of their surfaces B4. It also requires high memory storage and has high 
computation complexity, which makes volumetric format not appropriate for 
high-resolution data. 


2.2 Non-Euclidean data 


Point clouds: A point cloud is a set of unstructured points that approximate 
the geometry of an object. However, if we only consider the local structure of 
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the object, those subsets may also be considered as Euclidean since they have 
a global parameterization and are usually represented by a normal system of 
coordinates. It depends on the metrics method that is used. But most tasks 
still focus on the global structure for shape recognition, matching or retrieval, 
hence point clouds are still classified as Non-Euclidean data format in most 
cases. Nowadays we have multiple choices of 3D sensors to generate point 
clouds, e.g., Ensenso or Zivid, they usually do single-shot and capture the whole 
scene. Therefore, different from other formats, preprocessing steps such as 
noise filtering or scene segmentation are usually required for point clouds of 3D 
shapes. 


Meshes/graphs: A polygon mesh is a collection of vertices, edges and faces 
that defines the shape of a polyhedral object in 3D computer graphics and solid 
modeling. With an appropriate number of vertices, meshes can give extremely 
accurate geometric information of 3D shapes. The vertices in a mesh have 
certain connectivities, which makes mesh a special case of graph. The process 
of generating an approximate watertight mesh from a random connected graph 
is called 3D shape completition or inpainting. Although meshes contains rich 
information of 3D shapes, it is really a challenging task to learn on them directly 
due to its irregularity. In most relevant researches, the spectral properties of the 
graphs and meshes are utilized to learn latent features after applying a graph 
Laplacian eigen-decomposition. 


Continuous space function: Continuous space functions are a very special 
data format. It uses a mathematical function to represent the 3D shape directly 
and precisely. It is also referred as level set or signed distance function (SDF) 
with minor definition modification. Input a coordinate in the defined space, a 
SDF outputs a value whose sign (positive or negative) denotes that this point is 
outside or inside the shape boundary. For example, if the output space of a SDF 
is defined between [—1, 1], the whole function may be considered as a mapping 
function f : R — [-1,1]. If 0 is defined as the cutoff boundary, then all the 
points whose coordinates yield an output between [—1, 0] after the mapping 
means they are inside the object surface, and vice versa. However, only simple 
shapes like cube, heart, donuts or lemon can be easily denoted with a SDF. It 
is more often impossible to find such a function for a slightly complex shape. 
Thus this data format is less explored comparing to others. 
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Table 2.1: Property comparison of different 3D data formats 
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2.3 Property comparison 


It is impossible to say which data format is the best 3D data format. Apart 
from the accuracy requirements, to better make use of the 3D information, it is 
usually excepted that the data should be geometrically manipulable (deformation, 
interpolation, etc.) and convenient to impose structural constraints. On the other 
hand, since we are interested in applying deep learning algorithms on them, the 
data should also be able to be easily formulated as the input/output to neural 
networks and make fast forward/backward propagation computation possible. 
Here, based on the state-of-the-art researches, we summarize the overall rating 
subjectively on these properties of different 3D data formats in Table [2.1] In 
most cases, people will just use the most appropriate data format for their tasks 
according to the input source limitation, computation ability, and accuracy and 
robustness requirements. 


3 Latent representations of 3D data 


The process of acquiring latent representations from input data is essentially a 
mapping process. It maps the input data from its original data space to another 
latent space, which are usually lower dimensional. In statistics definition, latent 
representations (or, latent variables) are variables that are not directly observed 
but are rather inferred through a mathematical model from other variables that 
are directly observed and measured. Although multi-view images or volumetric 
data may be regarded as a special mapping method that maps the original 
geometric data into a lower dimensional space, those data representations are 
usually not considered as latent ones since we can still observe shape properties 
directly on them. Hence, in this report, we regard them as other kinds of data 
formats and not as latent representations. 


Before the recent upsurge of deep learning, there were already many other 
classical mathematical methods that try to encode 3D data, mostly on 3D shapes. 
For 3D shapes, the latent representations of them are also called as shape 
descriptors. In this section, we first make a brief overview on those classical 
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methods and the shape descriptors they generated, then probe into the latent 
representations learned with neural networks. 


3.1 Classical methods 


Classical methods usually have very strong mathematical background, involving 
strict mathematical formulas and deductions. Therefore the encoding results 
from them are usually deterministic. There are numerous classical methods that 
try to learn latent representations from 3D data, whether on Euclidean formats 
or Non-Euclidean formats. Here we just summarize and list some most known 
ones that may be related or helpful to our future work. 


Ray-based sampling with spherical harmonics: In order to characterize 
shapes of functions on a sphere by just a few parameters, spherical harmonics 
pi were proposed as a suitable tool. The magnitudes of complex coefficients, 
which are obtained by applying the fast Fourier transform on the sphere to the 
samples, are regarded as vector components. Thus, the ray-based feature vector 
is represented in the spectral domain, where each vector component is formed 
by taking into account all original input. 


Laplacian spectral eigenvectors: In addition to considering the connectivity 
of nodes and edges in a graph, mesh Laplacian operators take into account the 
geometry of a surface (e.g. the angles at the nodes). For a manifold triangle 
mesh, the Laplace-Beltrami operator is used to represent the intrinsic geometric 
structure. After applying the Laplacian eigen-decomposition, the original shape 
may be represented by its spectral eigenvectors, which makes mesh processing 


and surface editing [25]possible. 


Heat kernel signature: A heat kernel signature (HKS) is a shape descriptor 
obtained via spectral shape analysis methods and in use for deformable shape 
analysis. It is based on heat kernel, which is a fundamental solution to the 
heat equation (27). For each point in the shape, HKS defines its feature vector 
representing the point’s local and global geometric properties. HKS is one 
of the many recently introduced shape descriptors which are based on the 
LaplaceBeltrami operator associated with the shape. There are other relevant 
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shape descriptors including global point signature (GPS), biharmonic signature 
(BS), wave kernel signature (WKS). 


Skeleton-based 3D descriptor: Skeletons derived from solid objects can be 
regarded as intuitive object descriptions. They are able to capture the most 
important information about the shape structure. Sundar et al. presented 
a framework for skeletonization and 3D object retrieval. Skeleton-based 3D 
descriptor is widely used in animation and film industrial nowadays due to its 
ideal parameterized control on the shape joints. 


Primitive-based CAD model descriptor: 3D shapes may be approximately 
assembled by composing simple volumetric primitives including cuboids, cylin- 
ders and spheres. The shapes from one category usually have similar primitive 
representations. Using this abstract representation, interpolation between the 
obtained latent representations may provide a consistent parsing across shapes 
in one certain category. 


3.2 Neural learning-based methods 


Comparing to the classical methods, neural learning-based methods are less 
deterministic since they have more stochastic calculations involved. The final 
parameters of a trained neural network may be slightly different even though all 
the settings are identical in multiple trainings. 


Actually, the latent representations learned via neural networks are seldom of 
particular concern in most computer vision tasks, while they have always been 
implicitly used. A good example would be the bottleneck features in transfer 
learning. In transfer learning, we take a pre-trained model including network and 
weights, then remove the last few fully connected (FC) network and construct 
our own in place of it. When the training starts on the new data set, usually the 
original network parameters before the FC network are frozen and only the newly 
added FC network are trained. Here the input to the FC network is referred 
as bottleneck features. They represent the latent features learned from the last 
convolution layer in the network. Surely we can take the feature maps from any 
previous layer and name them as bottleneck features or latent representations, but 
in most cases we are more interested in a vector representation, thus a flattened 
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Convolutional Encoder-Decoder 


Latent | Latent representations | 


Figure 3.1: The basic structure of a neural encoder-decoder. The feature maps/vectors learned 
inside the network may be regarded as latent representations. 


vector bottleneck feature are more often taken and used. But, still, the properties 
of bottleneck features themselves are really less explored. 


Also in generation tasks, latent representations are also crucial to learning. A 
typical generative adversarial network (GAN) may take a vector from the latent 
space as the input to generate pseudo real world data. Interpolating between the 
input latent vectors, a continuous reshaping or deforming output can usually be 
observed. 


Figure gives a brief idea how latent representations are learned within a 
neural encoder-decoder. A more detailed survey of how latent representations 
of 3D shapes are obtained and utilized with deep learning methods is given in 
the next section. 


4 Deep learning on 3D data 


4.1 Learning on 3D Euclidean data 


In order to duplicate the success of deep learning techniques from the 2D domain 
to the 3D domain, it is easy to see that we can use 3D Euclidean data directly 
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for learning purposes. In case of only 3D Non-Euclidean data are provided, 
we can always convert them into Euclidean formats with a certain information 
loss. Due to its simplicity and convenience, this converting process has been 
widely utilized to create rasterized data to fit in the Euclidean neural network 
architectures ever since the emerging of deep learning, even till now. 


4.1.1 Image-based representations 


When using RGB-D images or multi-view images as the input for deep learning 
tasks, it is often required to have multiple input channels or even multiple CNN 
streams to process the data. For example, [5] used a two-stream CNNs on 
RGB-D data for 3D object recognition tasks. The learned latent features from 
two streams were fused together in one later FC layer and the classification 
result was given after a further softmax layer. A more interesting method was 
proposed in (2). in which the idea of transfer learning was combined with the 
method used in 6}. It used four separate CNNs to train the four channels in the 
RGB-D data, while the weights were transferred from each network to another. 
Their results indicated that the depth information carries valuable information 
about shapes. 


More processing streams will be needed for the multi-view images data format. 
MVCNN processed rendered 12 views of a 3D object separately. Then a 
max pooling operation was applied in the view-pooling layer to get a compact 
latent representation for the whole shape. In (37). a multi-branch CNN has been 
designed to use rendered depth maps from different views of the object as input. 
Each branch returned a feature vector that contributes to the final classification. 
Apart from single value output recognition/classification tasks, this format has 
also been used for other more complex tasks. Kalogerakis et al. designed a 
neural network for segmenting 3D objects into their labeled semantic parts by 
learning from their multiple 2D projections. Local shape descriptors from part 
correspondences have also been learned with a multi-view convolutional network 
[10]. Even 3D shape reconstruction via multi-view convolutional networks has 
also been studied from sketches in 113]. 
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4.1.2 Volumetric data 


Regular 2D convolution operations have been naturally extended to 3D con- 
volution operations by applying 4D convolutional kernels, certain network 
architectures have also been proposed. VoxNet |15] first converted the point 
clouds of shapes into voxels according to their occupancy in the space. Then this 
volumetric data was used as input to their neural network for shape classification. 
A similar method has been propose in 3DShapeNets except they got the 
volumetric data from depth maps. As a followed work, Seaghat et al. 
modified the architecture of VoxNet by incorporating the orientation of 3D 
objects in the learning process. 


Regarding the synthesis tasks with 3D volumetric data, in 32], by extending 
the idea of GAN in the 2D domain, volumetric generative adversarial networks 
have also been proposed. In McRecon network structure 8]. foreground masks 
have been used as weak supervision through a raytrace pooling layer for 3D 
reconstruction. There are also octree-based methods which only consider the 
occupied grids in a more memory efficient way including OctNet[22] and 


O-CNN 9]. 


4.2 Learning directly on 3D Non-Euclidean data 


As mentioned in the last subsection, people can always convert 3D Non-Euclidean 
data to Euclidean formats for convenient neural network architecture designs 
since the technical maturity of similar methods in 2D domain are already quite 
high. However, object information will be inevitably lost during the converting 
process. The best way to prevent this information loss is learning directly on 3D 
Non-Euclidean data, in which special ways to define the input, output, or even 
the operations used in the networks are usually required. 


4.2.1 Point clouds 
The very first proposed deep learning-based method of directly using 3D point 


clouds data for shape analysis tasks is PointNet 120]. It used (x, y, z) coordinates 
of points as input to the network, then an additional spatial transform network 
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was performed as a pre-processing step. After that, lots of weights-sharing 
fully connected layers were added to compute point-wise features. Finally, a 
max-pooling layer was used to aggregate the global information and output a 
1024 dimensional latent feature vector for classification tasks. For segmentation 
tasks, the global shape feature and the point-wise features were concatenated 
for predicting point-wise segmentation result. Despite the competitive results 
achieved by PointNet, it still failed to take full advantage of the local features 
in point clouds. Their subsequent work PointNet++ tried to address this 
point by grouping the points with different scales, performing PointNet on them 
separately in order to aggregate different scale features. To better aggregate the 
information in the real local area, aggregate operations similar to the convolution 
operations have also been proposed, such as EdgeConv defined in or X-Conv 
defined in 112]. Both of them took a certain number of neighbours of each 
point into consideration and performed the aggregating operation point-wise. 
With this operations, the learned final latent representation also contains local 
information implicitly. 


In 3D point clouds synthesis field, [1] proposed a deep auto encoder (AE) with 
high reconstruction quality and generalization. Generative adversarial networks 
(GANs) and Gaussian Mixture Models (GMMs) have also been trained in the 
latent space of their AEs respectively. Similarly, FoldingNet proposed a 
point clouds auto-encoder via deep grid deformation with graph-based encoders, 
in which special perceptron layers were defined as folding operations. Regarding 
the upsampling task for sparse point clouds, PU-Net was especially designed 
with convolution operations defined in the latent feature space BS). 


4.2.2 Meshes 


At first glance, triangular meshes give people the illusion that 2D convolutional 
kernels may be directly applied. However, these rasterized kernels are only 
applicable to Euclidean data due to their structure shift invariance property. 
In order to perform convolution locally, appropriate local patches need to be 
defined. Geodesic CNN (GCNN) constructed local patches in local polar 
coordinates to ensure their structure non-position-dependent. Values of the 
functions around each vertex in the mesh are mapped into local polar coordinates 
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using the patch operator, thus geodesic convolution may be applied on those 
patches. Later on, Anisotropic CNN (ACNN) was proposed to tackle the 
limitations in GCNN. It constructed a simpler pattern of local patches, which 
are independent to the injectivity radius of meshes. Rather than using a fixed 
kernel pattern as in GCNN and ACNN, MoNet were proposed to define a 
vertex-wise locally weighted coordinate system, on which parametric kernels 
were applied to define the weighting functions. With this definition, GCNN and 
ACNN may be considered as special cases of MoNet with certain constraints. 


Except for those methods defined on the spatial domain, methods defined on 
the spectral domain have also been proposed. For example, (6) first computed 
heat kernel descriptors of shapes based on their heat kernel signatures (HKS), 
then the descriptors were fed into two neural networks with target value using 
Eigen-shape Descriptor and Fisher-shape Descriptor, respectively. The final 
deep shape descriptor is formed by concatenating nodes in hidden layers. 
proposed a similar pipeline with local point signature (LPS) features. Multi- 
scaled vertex spectral images were generated by packing the 16-dimensional 
LPS in a compact manner, and then fed into a CNN to generate the final shape 
descriptor. Those methods show the possibility that shape properties obtained 
via classical methods may be further utilized with the deep learning methods 
to get a better latent representation, with which better performance of different 
tasks may be achieved. 


4.2.3 Continuous space function 


Continuous space function (CSF) or signed distance function (SDF) is a really 
less explored data format. Although it provides high accuracy, it is usually 
impossible to easily find a function that matches a slightly complex object. 
Fortunately, neural networks are "universal approximators" and can mimic any 
continuous function to the degree that the network size permits. 


Early this year, DeepSDF was proposed to learn a continuous SDF represen- 
tation for a 3D shape, which encoded a shape’s boundary as the zero-level-set 
of the learned function that explicitly divided the space into shape interior and 
shape exterior. Deep Level Sets also deployed a similar idea to represent 
the output as an oriented level set of a continuous embedding function with the 
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help of deep neural networks. In a more recent paper, Mescheder et al. 
proposed Occupancy Networks, which also used a network to mimic functions 
that define the shape boundaries. An interesting adaption in their method is that 
rather than a signed value, the output of the network is a real value between 
0 and 1, which indicates the occupancy possibility of a certain point in that 
space position. Although all those methods usually need a post-processing step 
to visualize the shapes, the reconstruction performance of them are usually 
qualitatively better than the performance of classical methods that only work for 
point clouds or meshes. 


5 Conclusion 


In this report, we first briefly review the most used 3D data formats, including both 
the Euclidean ones and the Non-Euclidean ones. Secondly, latent representations 
or shape descriptors obtained via classical methods and deep neural networks 
have been reviewed and discussed. While several classical methods have been 
addressed, more efforts have been put into investigating the neural learning- 
based methods. Latent representations of different 3D data formats learned with 
various network architectures have been reviewed and discussed, the possibility 
of combing classical methods and neural learning-based methods has also been 
especially addressed. Although within the deep learning scope, the dominant 
approaches that utilized for various computer vision tasks nowadays are still 
usually based on images or other Euclidean data, we hope that with a better 
learning and understanding of the latent representations of 3D shapes, more 
efficient architectures may be proposed and better performance may be achieved 
with them in the future. 
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